Examples for Third-Variable Effect Analysis

Data Organization and Identify Potential Mediators/Confounders

We use weight_behavior data as an example to show how to do a preliminary data analysis to identify potential MCs and covariates. The function data.org is used for this purpose. The list of arguments for data.org are defined in Yu and Li (2017). In the following example, the predictor, “pred,” is gender, and the outcome, “y,” is a binary indicator of overweight (1) or not (0). The reference group of the exposure variable is the male as defined by “predref=”M"“. There are 12 variables in”x“, which includes all potential MCs and covariates. As an argument in the function data.org,”mediator=5:12" indicates that the variables from column 5 to column 12 in x should be tested for potential MCs. Other columns in “x” are to be treated as covariates. The “mediator” argument can also be a character vector with names of the potential MCs in x. Two tests are needed to identify a potential MC: first, the variable is significantly related with the predictor adjusting for other covariates. The other covariates here is defined by “cova.” If cova is a data frame by itself, all covariates in cova are for all potential MCs. If cova is a list, the first item will be the covariate data frame and the second item list the names or column numbers of MCs in “x” that should adjust for the covariates. The significance level is set by “alpha2,” whose default value is 0.1. Second, the variable has to be significantly related with the outcome adjusting (testtype=1, by default) or not adjusting (testtype=2) for the predictor and other variables. The significance level is set by “alpha.” In the following example, p-value 1 shows the results for the second test. and p-value 2 are the results for the first test. A variable that pass the second test but not the first test is considered as covariates. Variables do not pass the second test are not adopted in further analysis. Variables in “x” but not in “mediator” are forced in further analysis as covariates.

“jointm” is a list with the first item identify the number of groups of MCs whose joint effects are of interesting, where each group is defined in an item of “jointm.” In the above example, the joint effect of Variables 5, 7, 9 is of interesting. These variables are forced into analysis as potential MCs. A variable can be shown in “jointm” multiple times. The joint effect of each group of variables will be reported. A trick to force in variables as potential MCs is to assign them in a separate group of “jointm.”

One binary/continuous predictor and one binary/continuous outcome

data("weight_behavior")
#binary predictor x
#binary y
 x=weight_behavior[,c(2,4:14)]
 pred=weight_behavior[,"sex"]
 y=weight_behavior[,"overweigh"]
 data.b.b.1<-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=7:9),
                        pred=pred,predref="M", alpha=0.4,alpha2=0.4)
 summary(data.b.b.1)
#> Identified as mediators: 
#> [1] "tvhours"   "cmpthours" "cellhours" "exercises" "sweat"     "sports"   
#> Selected as covariates: 
#> [1] "age"       "race"      "numpeople" "car"      
#> Tests: 
#>             P-Value 1.y P-Value 2.pred
#> age               0.836             NA
#> race              0.850             NA
#> numpeople         0.561             NA
#> car               0.052             NA
#> gotosch           0.710             NA
#> snack             0.724             NA
#> tvhours -         0.830          0.484
#> cmpthours -       0.826          0.760
#> cellhours -       0.067          0.688
#> sports *          0.000          0.003
#> exercises *       0.176          0.203
#> sweat *           0.181          0.046
#> pred              0.178             NA
#> ----
#>  *:mediator,-:joint mediator
#>  P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big) 
#>  when testtype=1, univariate relationship test with the outcome when testtype=2
#>  P-Value 2:Tests of relationship with the Predictor

The results from “data.org” function is summarized as above.

A potential MC defined in “mediator” argument is identified as categorical or binary if it has only two unique values or if the variable is a vector factors or characters. By default, the reference group is the first level of the variable. If the reference group needs to be changed, the categorical variable should be defined in binmed or catmed, and then the corresponding reference group in binref and catref. Continuous MCs can be defined by contmed. The following example did the same analysis as above, only that it defines that potential continuous mediators are columns 7,8,9,11, and 12 of x, binary mediators are columns 6 and 10, and categorical mediator is column 5 of x with 1 to be the reference group for all categorical and binary mediators.

 data.b.b.2<-data.org(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10), jointm=list(n=1,j1=7:9),
   binref=c(1,1),catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4) 
 summary(data.b.b.2)

Survival outcome

The functions in mma can deal with binary, categorical, continuous, or time-to-event outcomes. If the outcome is time-to-event, it should be defined by the “Surv” function in the survival package. An example of application on breast cancer survival can be found in Yu et al. (2019).

 cgd1<-cgd0
 status<-ifelse(is.na(cgd1$etime1),0,1)
 y=Surv(cgd1$futime,status) 
 x=cgd1[,c(5:7,9:12)]
 pred=cgd1[,4]
 data.b.surv<-data.org(x,y,pred=pred,mediator=1:ncol(x),   
                    alpha=0.4,alpha2=0.4)
 summary(data.b.surv)

Multivariate predictors

In addition, the package can handle multivariate and multicategorical predictors. If the predictor is multicategorical of k levels, “data.org” first transform the predictor to k-1 binary predictors. If a variable significantly relates with any of the k-1 predictors, the variable passes the first test described above. P-value 2 is shown for each predictor. In the following example, the predictor is the race with six levels.

 #multivariate predictor
 x=weight_behavior[,c(2:3,5:14)]
 pred=weight_behavior[,4]
 y=weight_behavior[,15]
 data.mb.b <-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=c(5,7,9)),
                      pred=pred,predref="OTHER", alpha=0.4,alpha2=0.4)
 summary(data.mb.b)
#> Identified as mediators: 
#> [1] "tvhours"   "cellhours" "sweat"     "sports"    "gotosch"  
#> Selected as covariates: 
#> [1] "age"       "sex"       "numpeople" "car"       "exercises"
#> Tests: 
#>               P-Value 1.y P-Value 2.pred P-Value 2.predAFRICAN
#> age                 0.836             NA                    NA
#> sex                 0.178             NA                    NA
#> numpeople           0.561             NA                    NA
#> car                 0.052             NA                    NA
#> gotosch -           0.710          0.112                 0.796
#> snack               0.724             NA                    NA
#> tvhours -           0.830          0.748                 0.535
#> cmpthours           0.826             NA                    NA
#> cellhours -         0.067          0.880                 0.994
#> sports *            0.000          0.587                 0.163
#> exercises           0.176          0.629                 0.731
#> sweat *             0.181          0.647                 0.174
#> pred                0.543             NA                    NA
#> predAFRICAN         0.663             NA                    NA
#> predCAUCASIAN       0.242             NA                    NA
#> predINDIAN          0.890             NA                    NA
#> predMIXED           0.782             NA                    NA
#>               P-Value 2.predCAUCASIAN P-Value 2.predINDIAN P-Value 2.predMIXED
#> age                                NA                   NA                  NA
#> sex                                NA                   NA                  NA
#> numpeople                          NA                   NA                  NA
#> car                                NA                   NA                  NA
#> gotosch -                       0.996                0.097               0.033
#> snack                              NA                   NA                  NA
#> tvhours -                       0.334                0.916               0.092
#> cmpthours                          NA                   NA                  NA
#> cellhours -                     0.707                0.555               0.383
#> sports *                        0.816                0.453               0.091
#> exercises                       0.723                0.912               0.881
#> sweat *                         0.214                0.374               0.203
#> pred                               NA                   NA                  NA
#> predAFRICAN                        NA                   NA                  NA
#> predCAUCASIAN                      NA                   NA                  NA
#> predINDIAN                         NA                   NA                  NA
#> predMIXED                          NA                   NA                  NA
#> ----
#>  *:mediator,-:joint mediator
#>  P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big) 
#>  when testtype=1, univariate relationship test with the outcome when testtype=2
#>  P-Value 2:Tests of relationship with the Predictor

Multivariate outcomes

Similarly, the package can deal with multivariate outcomes. The following code deals with multivariate predictors and multivariate responses. If the variable is significantly related with any one of the outcomes, it passes the second test described above. The results from “data.org” are summarized for each combination of the exposure-outcome relationship.

 #multivariate responses
 x=weight_behavior[,c(2:3,5:14)]
 pred=weight_behavior[,4]
 y=weight_behavior[,c(1,15)]
 data.mb.mb<-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=c(5,7,9)),
                      pred=pred,predref="OTHER", alpha=0.4,alpha2=0.4)
 summary(data.mb.mb)
#> Identified as mediators: 
#> [1] "tvhours"   "cellhours" "sweat"     "sports"    "gotosch"  
#> Selected as covariates: 
#> [1] "age"       "sex"       "numpeople" "car"       "exercises"
#> Tests: 
#>               P-Value 1.bmi P-Value 1.overweigh P-Value 2.pred
#> age                   0.458               0.836             NA
#> sex                   0.003               0.178             NA
#> numpeople             0.282               0.561             NA
#> car                   0.059               0.052             NA
#> gotosch -             0.527               0.710          0.112
#> snack                 0.830               0.724             NA
#> tvhours -             0.505               0.830          0.748
#> cmpthours             0.676               0.826             NA
#> cellhours -           0.084               0.067          0.880
#> sports *              0.001               0.000          0.587
#> exercises             0.089               0.176          0.629
#> sweat *               0.078               0.181          0.647
#> pred                  0.344               0.543             NA
#> predAFRICAN           0.900               0.663             NA
#> predCAUCASIAN         0.032               0.242             NA
#> predINDIAN            0.446               0.890             NA
#> predMIXED             0.833               0.782             NA
#>               P-Value 2.predAFRICAN P-Value 2.predCAUCASIAN
#> age                              NA                      NA
#> sex                              NA                      NA
#> numpeople                        NA                      NA
#> car                              NA                      NA
#> gotosch -                     0.796                   0.996
#> snack                            NA                      NA
#> tvhours -                     0.535                   0.334
#> cmpthours                        NA                      NA
#> cellhours -                   0.994                   0.707
#> sports *                      0.163                   0.816
#> exercises                     0.731                   0.723
#> sweat *                       0.174                   0.214
#> pred                             NA                      NA
#> predAFRICAN                      NA                      NA
#> predCAUCASIAN                    NA                      NA
#> predINDIAN                       NA                      NA
#> predMIXED                        NA                      NA
#>               P-Value 2.predINDIAN P-Value 2.predMIXED
#> age                             NA                  NA
#> sex                             NA                  NA
#> numpeople                       NA                  NA
#> car                             NA                  NA
#> gotosch -                    0.097               0.033
#> snack                           NA                  NA
#> tvhours -                    0.916               0.092
#> cmpthours                       NA                  NA
#> cellhours -                  0.555               0.383
#> sports *                     0.453               0.091
#> exercises                    0.912               0.881
#> sweat *                      0.374               0.203
#> pred                            NA                  NA
#> predAFRICAN                     NA                  NA
#> predCAUCASIAN                   NA                  NA
#> predINDIAN                      NA                  NA
#> predMIXED                       NA                  NA
#> ----
#>  *:mediator,-:joint mediator
#>  P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big) 
#>  when testtype=1, univariate relationship test with the outcome when testtype=2
#>  P-Value 2:Tests of relationship with the Predictor

##Third-Variable Effect Analysis Next, we use med function to estimate the TVE using the results from data.org function. The default approach is using a generalized linear model to fit the final full model. If “nonlinear=TRUE”", multiple Additive Regression Trees (MART) will be used to fit the relationship between outcome(s) and other variables, and smoothing splines used to fit the relationship between the potential MC and exposure variables(s).

Binary predictor and binary outcome

 med.b.b.2<-med(data=data.b.b.2,n=2,nonlinear=TRUE)

Using the linear method and to show the results:

 med.b.b.1<-med(data=data.b.b.2,n=2)
 med.b.b.1
#> 
#> 
#> For the predictor pred :
#>  The estimated total effect:[1] 0.6475
#> 
#>  The estimated indirect effect:
#>       y1.all   y1.tvhours y1.cmpthours y1.cellhours y1.exercises     y1.sweat 
#>       0.2945       0.0072       0.0398       0.0843      -0.0218      -0.0114 
#>    y1.sports        y1.j1 
#>       0.1464       0.0768

We can plot the marginal effect of the MC on the outcome, and the marginal effect of the predictor on the MC. If the predictor is binary, plot.med draws a histogram or boxplot of the marginal density of the MC at each different value of the predictor.

plot(med.b.b.1, data.b.b.2,vari="exercises",xlim=c(0,50))
plot(med.b.b.2, data.b.b.2,vari="sports")

Survival outcome

For survival outcome, the default option is to fit the final full model using Cox proportional hazard model. Note that when use type=“lp,” the linear predictor is the type of predicted value as in predict.coxph, the results is similar to those from MART. Readers are referred to Yu et al. (2019) for more details.

 med.b.surv.1<-med(data=data.b.surv,n=2,type="lp") 
  #close to mart results when use type="lp"
 med.b.surv.2<-med(data=data.b.surv,n=2,nonlinear=TRUE)  
 #results in the linear part unit

Statistical Inference on TVE Analysis

boot.med function is used to make inferences on the TVE. Results return from data.org function is input as the first argument.

Binary predictor and binary outcome

 bootmed.b.b.1<-boot.med(data=data.b.b.2,n=2,n2=50)
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10
#> [1] 11
#> [1] 12
#> [1] 13
#> [1] 14
#> [1] 15
#> [1] 16
#> [1] 17
#> [1] 18
#> [1] 19
#> [1] 20
#> [1] 21
#> [1] 22
#> [1] 23
#> [1] 24
#> [1] 25
#> [1] 26
#> [1] 27
#> [1] 28
#> [1] 29
#> [1] 30
#> [1] 31
#> [1] 32
#> [1] 33
#> [1] 34
#> [1] 35
#> [1] 36
#> [1] 37
#> [1] 38
#> [1] 39
#> [1] 40
#> [1] 41
#> [1] 42
#> [1] 43
#> [1] 44
#> [1] 45
#> [1] 46
#> [1] 47
#> [1] 48
#> [1] 49
#> [1] 50
 summary(bootmed.b.b.1)
#> MMA Analysis: Estimated Mediation Effects Using GLM
#> For Predictor/Moderator at pred 
#> $total.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.603  0.644  0.375  1.379 -0.091  1.283 -0.031  1.529 -0.090  0.086  0.120 
#> 
#> $direct.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.339  0.374  0.377  1.113 -0.364  1.071 -0.284  1.154 -0.400  0.320  0.320 
#> 
#> $indirect.effect
#>        y1.all y1.tvhours y1.cmpthours y1.cellhours y1.exercises y1.sweat
#> est     0.263     -0.003        0.001        0.026       -0.006    0.007
#> mean    0.270      0.004        0.042        0.015       -0.014    0.016
#> sd      0.149      0.052        0.073        0.062        0.028    0.032
#> upbd    0.562      0.105        0.185        0.138        0.042    0.078
#> lwbd   -0.023     -0.097       -0.102       -0.107       -0.070   -0.046
#> upbd_q  0.539      0.104        0.207        0.139        0.030    0.074
#> lwbd_q  0.003     -0.082       -0.077       -0.117       -0.078   -0.037
#> upbd_b  0.603      0.137        0.217        0.169        0.040    0.076
#> lwbd_b -0.046     -0.094       -0.084       -0.138       -0.080   -0.072
#> p_norm  0.071      0.945        0.569        0.807        0.616    0.612
#> p_quan  0.080      0.920        0.600        0.600        0.680    0.480
#>        y1.sports  y1.j1
#> est        0.147  0.017
#> mean       0.164  0.083
#> sd         0.084  0.099
#> upbd       0.328  0.277
#> lwbd      -0.001 -0.112
#> upbd_q     0.356  0.297
#> lwbd_q     0.057 -0.081
#> upbd_b     0.394  0.342
#> lwbd_b     0.017 -0.099
#> p_norm     0.051  0.405
#> p_quan     0.000  0.360

Similarly, the default approach is using a generalized linear model to fit the final full model. If the argument nonlinear=TRUE, multiple Additive Regression Trees (MART) will be used to fit the model in estimating the outcome. Applications for nonlinear Mediation Analysis are described in Yu et al. (2017, 2018).

 bootmed.b.b.2<-boot.med(data=data.b.b.2,n=2,n2=40,nu=0.05,nonlinear=TRUE)

 summary(bootmed.b.b.2)
#> MMA Analysis: Estimated Mediation Effects Using MART
#> For Predictor/Moderator at pred 
#> $total.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.067  0.271  0.034  0.337  0.205  0.313  0.237  0.316  0.235  0.000  0.000 
#> 
#> $direct.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#> -0.014  0.078  0.087  0.249 -0.093  0.192  0.004  0.203  0.001  0.370  0.000 
#> 
#> $indirect.effect
#>        y1.all y1.tvhours y1.cmpthours y1.cellhours y1.exercises y1.sweat
#> est     0.081      0.004        0.001       -0.013        0.000    0.000
#> mean    0.193      0.004        0.049        0.012       -0.006    0.006
#> sd      0.108      0.011        0.059        0.045        0.035    0.016
#> upbd    0.405      0.025        0.165        0.101        0.061    0.037
#> lwbd   -0.020     -0.016       -0.066       -0.076       -0.074   -0.025
#> upbd_q  0.266      0.015        0.128        0.059        0.038    0.027
#> lwbd_q  0.047     -0.008        0.007       -0.042       -0.033   -0.003
#> upbd_b  0.268      0.016        0.041        0.062        0.042    0.030
#> lwbd_b  0.033      0.004        0.007       -0.046       -0.029   -0.002
#> p_norm  0.076      0.678        0.405        0.782        0.852    0.694
#> p_quan  0.000      0.500        0.000        0.500        0.500    1.000
#>        y1.sports  y1.j1
#> est        0.070  0.011
#> mean       0.165  0.067
#> sd         0.097  0.050
#> upbd       0.354  0.164
#> lwbd      -0.025 -0.030
#> upbd_q     0.267  0.120
#> lwbd_q     0.049  0.016
#> upbd_b     0.192  0.122
#> lwbd_b     0.041  0.015
#> p_norm     0.089  0.176
#> p_quan     0.000  0.000

Plots of the fitted mma object from boot.med

plot.mma() plots the marginal effect of the selected variable on the outcome, and the marginal effect of the predictor on the selected variable. Interpretations for similar plots can be found in Yu et al. (2017).

 plot(bootmed.b.b.1,vari="exercises",xlim=c(0,50))

 plot(bootmed.b.b.1,vari="sports")

Combined function for multiple TVE analysis

The mma function is a combined function that automatically identify potential MCs, based on which to make statistical inference on the mediation effects.

 x=weight_behavior[,c(2,4:14)]
 pred=weight_behavior[,3]
 y=weight_behavior[,15]
 mma.b.b.glm<-mma(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10),binref=c(1,1), catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4,n=2,n2=2)
#> [1] 1
#> [1] 2
 summary(mma.b.b.glm)
#> MMA Analysis: Estimated Mediation Effects Using GLM
#> For Predictor/Moderator at pred 
#> $total.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.498  0.840  0.313  1.453  0.227  1.050  0.630  0.619  0.619  0.007  0.000 
#> 
#> $direct.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.429  0.619  0.258  1.125  0.113  0.792  0.446  0.437  0.437  0.016  0.000 
#> 
#> $indirect.effect
#>        y1.all y1.exercises y1.sweat y1.sports
#> est     0.069       -0.021    0.014     0.137
#> mean    0.221       -0.022   -0.023     0.170
#> sd      0.055        0.020    0.037     0.011
#> upbd    0.328        0.018    0.049     0.193
#> lwbd    0.113       -0.063   -0.095     0.148
#> upbd_q  0.258       -0.009    0.001     0.178
#> lwbd_q  0.184       -0.036   -0.048     0.163
#> upbd_b  0.182        0.182    0.182     0.182
#> lwbd_b -0.008       -0.008   -0.008    -0.008
#> p_norm  0.000        0.272    0.527     0.000
#> p_quan  0.000        0.000    1.000     0.000

When the argument nonlinear=TRUE, multiple Additive Regression Trees (MART) will be used to fit the final full model in estimating the outcome.

 mma.b.b.mart<-mma(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10),binref=c(1,1), catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4,nonlinear=TRUE,n=2,n2=5)
 summary(mma.b.b.mart)

 summary(mma.b.b.mart)
#> MMA Analysis: Estimated Mediation Effects Using MART
#> For Predictor/Moderator at pred 
#> $total.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.207  0.334  0.168  0.663  0.005  0.581  0.174  0.346  0.167  0.047  0.000 
#> 
#> $direct.effect
#>    est   mean     sd   upbd   lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan 
#>  0.078  0.213  0.212  0.629 -0.203  0.522  0.007  0.213 -0.002  0.316  0.400 
#> 
#> $indirect.effect
#>        y1.all y1.exercises y1.sweat y1.sports
#> est     0.128        0.003    0.004     0.090
#> mean    0.121        0.014    0.017     0.082
#> sd      0.047        0.011    0.024     0.025
#> upbd    0.212        0.037    0.064     0.131
#> lwbd    0.030       -0.008   -0.031     0.033
#> upbd_q  0.167        0.032    0.054     0.110
#> lwbd_q  0.056        0.005   -0.002     0.047
#> upbd_b  0.169        0.034    0.058     0.111
#> lwbd_b  0.100        0.008   -0.003     0.077
#> p_norm  0.009        0.207    0.492     0.001
#> p_quan  0.000        0.000    0.400     0.000

plot(mma.b.b.mart,vari="exercises")

#> Error in pred_name[, z.b]: incorrect number of dimensions

plot(mma.b.b.glm,vari="sweat")

Examples for Third-Variable Effect Analysis

Qingzhao Yu & Lin Zhu

2022-05-14

Package installation

Data Organization and Identify Potential Mediators/Confounders

One binary/continuous predictor and one binary/continuous outcome

Survival outcome

Multivariate predictors

Multivariate outcomes

Binary predictor and binary outcome

Survival outcome

Statistical Inference on TVE Analysis

Binary predictor and binary outcome

Plots of the fitted mma object from boot.med

Combined function for multiple TVE analysis

References