Describing Selection Bias from Dependent Censoring

John W. Jackson

2019-09-19

INTRODUCTION

This software implements three diagnostics for confounding/selection-bias that can be used in sequence. Built upon the framework of sequential exchangeability, these apply to any study of multivariate exposures e.g. time-varying exposures, direct effects, interaction, and censoring. The first two diagnostics pertain to the nature of confounding/selection-bias in the data, while the third is meant to examine residual confounding/selection-bias after applying certain adjustment methods. These tools are meant to help describe confounding/selection-bias in complex data after investigators have selected covariates to adjust for (e.g., through subject-matter knowledge).

CAPABILITIES

The tools can accommodate:

USAGE

WORKFLOW

The software is designed to use data that analysts may readily have on hand in analyses of time-varying exposures. This includes time-varying exposures, time-varying covariates, time-varying weights (when applicable), and time-varying propensity-score strata (when applicable). Time-varying exposure history strata may also be required, and we provide functions to create these from the exposure variables.

The analysis generally proceeds as follows (see below for examples):

  1. Format the data. If needed, use Widen() to transform person-time indexed dataframe into person indexed dataframe. If needed, use makehistory.one() or makehistory.two() to create variables that describe exposure history over time.

  2. Create a tidy dataframe. Use lengthen() Restructure the wide dataframe into a tidy dataframe indexed by person id, exposure measurement time, and covariate measurement time. If desired, use omit.history() to omit irrelevant covariate history (that will not be used by the analyst to adjust for confounding at time \(t\))

  3. Create a covariate balance table. Take the tidy dataframe output from lenthen() and use balance() to create a covariate balance table.

  4. Create a trellised covariate balance plot using makeplot().

Let’s go ahead and load the confoundr package. We’ll also load the dplyr package to hep present the results.

library(dplyr)
library(confoundr)

DEMONSTRATION OF DIAGNOSTIC 1

Suppose we wished to evaluate the comparative effectiveness of five medications on reducing positive and negative symptoms scale in patients with schizophrenia. Our simulated data are loosely based on a large comparative effectiveness trial, the CATIE Study, available from the National Institutes of Mental Health. Like many trials in this therapeutic area, it is plagued with high rates of dropout, and decisions to leave the study are sometimes informative of worsening symptom response. Such dropout can bias estimates of comparative effectiveness, even when dropout rates are similar across arms.

Inverse Probability of Censoring Weights are often used to correct such dropout under certain assumptions about the dropout process. For example, one might assume that the data censored by dropout are Missing at Random given most recent measures of covariates and the treatment arm. If the weights are estimated well, the weighted population will exhibit no association between measures of symptom response and later dropout over time. We will use the confoundr diagnostics to first describe these associations in the raw data, before weighting, and then after inverse probability weights have been applied.

Let’s load the data from our simulated trial:

data("catie_sim")

In this simulated trial, we have five treatment arms, with measures of symptom response, patient-reported outcomes, and history of treatment change, measured monthly or quarterly over 18 months (i.e., 0,1,3,6,9,12,15,18). We assume that, at each follow-up visit, study dropout is missing at random given the most recent measures, the treatment arm, and baseline covariates among those who have remained in the study. Here is a description of some baseline covariates:

catie_sim %>% filter(time==0) %>% select(starts_with("age.grp."),white,black,other,exacer,unemployed,starts_with("site")) %>% summarize_all("mean")
#> # A tibble: 1 x 14
#>   age.grp.1824 age.grp.2534 age.grp.3544 age.grp.4554 age.grp.5567 white
#>          <dbl>        <dbl>        <dbl>        <dbl>        <dbl> <dbl>
#> 1        0.594        0.347       0.0594            0            0 0.623
#> # ... with 8 more variables: black <dbl>, other <dbl>, exacer <dbl>,
#> #   unemployed <dbl>, site.ro <dbl>, site.sh <dbl>, site.uc <dbl>,
#> #   site.va <dbl>

(Most of the names are self-explanatory, but exacer refers to a recent hospitalization, and site is the type of clinical trial study site).

Here is a description of time-varying covariates:

catie_sim %>% group_by(time) %>% select(Chg.pansstotal,cs14,cs16,calg1,epsmean,qoltot,Chg.pansstotal,pct.gain,phase.change.cum,studydisc) %>% summarize_all("mean")
#> Adding missing grouping variables: `time`
#> # A tibble: 8 x 10
#>    time Chg.pansstotal  cs14  cs16 calg1 epsmean qoltot pct.gain
#>   <int>          <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>    <dbl>
#> 1     0          0      1.30  3.84  1.81   0.207   2.74   0     
#> 2     1         -0.502  1.31  3.82  1.80   0.201   2.75   0.0150
#> 3     3         -1.53   1.31  3.76  1.77   0.196   2.80   0.0481
#> 4     6         -3.02   1.29  3.69  1.73   0.190   2.86   0.0934
#> 5     9         -4.50   1.28  3.61  1.69   0.184   2.90   0.136 
#> 6    12         -5.99   1.28  3.53  1.65   0.177   2.95   0.186 
#> 7    15         -7.46   1.26  3.46  1.61   0.170   3.00   0.238 
#> 8    18         -8.90   1.26  3.38  1.58   0.164   3.03   0.278 
#> # ... with 2 more variables: phase.change.cum <dbl>, studydisc <dbl>

Before we begin, a few clarifying points about this example may be helpful. Our use of the diagnostics in this example is akin to assessing covariate balance across levels of study dropout. That is, we are treating study dropout as an exposure. The sticky point here is that we define study dropout as equal to 1 at the last visit and 0 at all times before. This means that we are actually evaluating censoring due to study dropout at time \(t+1\), indexed at time \(t\), as is done in most applied studies that use inverse probability of censoring weights.

Now let us proceed with our first step, to use Diagnostic 1 to describe the selection-bias in the data, e.g., how the covariates are not missing completely at random over time (that is, showing that censoring due to dropout depends at least on measured covariates).

STEP 1: FORMAT THE DATAFRAME

Our data formats should obey the following conventions:

  • All covariates should be numeric
  • Underscores should not appear in any variable root name.
  • Exposures
    • A common referent value–to which balance metrics will compare all other levels–should be coded as the lowest value
    • In our examples, we have two exposures. The first is treatment assignment, treat.grp. It is coded numerically as one variable, with the numbers indicating treatment arm at baseline. The second exposure is censoring due to study dropout, studydisc, which equals \(1\) at a participant’s last visit when they exit the study prematurely, and \(0\) otherwise.
  • Covariates
    • All categorical covariates should be broken up into numeric indicator variables.

For example, here is a snapshot of our example data (e.g., the simulated person CATIEID=1440 who dropped out after completing three study visits:

print(catie_sim %>% select(CATIEID,time,treat.grp,studydisc,pct.gain) %>% filter(CATIEID==1440))
#> # A tibble: 3 x 5
#>   CATIEID  time treat.grp studydisc pct.gain
#>     <int> <int>     <int>     <int>    <dbl>
#> 1    1440     0         1         0   0     
#> 2    1440     1         1         0   0.0194
#> 3    1440     3         1         1   0.0581

Our example data is currently indexed by person-id and time (i.e., it’s in “long” format). We need to convert our data into a version that is indexed by person-id (i.e., “wide” format). We use the widen() function to accomplish this.

catie_sim.w <- widen(
 input=catie_sim,
 id="CATIEID",
 time="time",
 exposure="studydisc",
 covariate=c("treat.grp","phase.change.cum",
             "Chg.pansstotal","cs14",
             "cs16","pct.gain",
             "epsmean","qoltot",
             "exacer","Bpansstotal",
             "unemployed"),
 weight.exposure="wtr"             
)
#> Adding missing grouping variables: `ID`
glimpse(catie_sim.w %>% select(CATIEID,starts_with("treat.grp"),starts_with("studydisc"),starts_with("pct.gain")) %>% filter(CATIEID==1440))
#> Observations: 1
#> Variables: 25
#> $ CATIEID      <int> 1440
#> $ treat.grp_0  <dbl> 1
#> $ treat.grp_1  <dbl> 1
#> $ treat.grp_12 <dbl> NA
#> $ treat.grp_15 <dbl> NA
#> $ treat.grp_18 <dbl> NA
#> $ treat.grp_3  <dbl> 1
#> $ treat.grp_6  <dbl> NA
#> $ treat.grp_9  <dbl> NA
#> $ studydisc_0  <dbl> 0
#> $ studydisc_1  <dbl> 0
#> $ studydisc_12 <dbl> NA
#> $ studydisc_15 <dbl> NA
#> $ studydisc_18 <dbl> NA
#> $ studydisc_3  <dbl> 1
#> $ studydisc_6  <dbl> NA
#> $ studydisc_9  <dbl> NA
#> $ pct.gain_0   <dbl> 0
#> $ pct.gain_1   <dbl> 0.01935687
#> $ pct.gain_12  <dbl> NA
#> $ pct.gain_15  <dbl> NA
#> $ pct.gain_18  <dbl> NA
#> $ pct.gain_3   <dbl> 0.0580706
#> $ pct.gain_6   <dbl> NA
#> $ pct.gain_9   <dbl> NA

In this wide dataset, each variable measurement appears as its own variable, indexed by time (e.g., var_0, var_1, var_3, var_6, ...), including static covariates that are measured at baseline and do not change over time. Generally, any scheme can be used to index time (e.g., var_0, var_1, var_2, ...) and it could be arbitrarily different for each covariate, supporting heterogeneous covariate measurement schedules as is common in clinical trial and cohort studies.

There is one more housekeeping task. If we wish to obtain balance diagnostics that adjust for prior exposure history, or for a categorical baseline variable, we can use the makehistory.one() or (makehistory.two()) function to create these measures in the “wide” dataset. This will create a variable that describes exposure history through time \(t-1\), indexed at time \(t\), possibly stratified by a baseline covariate measurement. In our example, we want to first compare covariate balance across treatment arms. Because this exposure is static, it needs no history variable. We also want to compare at each time \(t\) the censored vs. the uncensored, among those in the same treatment arm who have remained uncensored before time \(t\). Thus, we need a history variable for censoring that tells us which treatment arm they belong to, and summarizes their censoring history.

We can create the appropriate history strata for our example by using makehistory.two(), with treatment arm as the first exposure and studydropout as the second:

catie_sim.wh <- makehistory.two(
 input=catie_sim.w,
 id="CATIEID",
 exposure.a="treat.grp",
 exposure.b="studydisc",
 times=c(0,1,3,6,9,12,15,18),
 name.history.a="ht",
 name.history.b="hs"
)
glimpse(catie_sim.wh %>% select(CATIEID,starts_with("pct.gain"),starts_with("treat.grp"),starts_with("studydisc"),starts_with("ht"),starts_with("hs")) %>% filter(CATIEID==1440))
#> Observations: 1
#> Variables: 41
#> $ CATIEID      <int> 1440
#> $ pct.gain_0   <dbl> 0
#> $ pct.gain_1   <dbl> 0.01935687
#> $ pct.gain_12  <dbl> NA
#> $ pct.gain_15  <dbl> NA
#> $ pct.gain_18  <dbl> NA
#> $ pct.gain_3   <dbl> 0.0580706
#> $ pct.gain_6   <dbl> NA
#> $ pct.gain_9   <dbl> NA
#> $ treat.grp_0  <dbl> 1
#> $ treat.grp_1  <dbl> 1
#> $ treat.grp_12 <dbl> NA
#> $ treat.grp_15 <dbl> NA
#> $ treat.grp_18 <dbl> NA
#> $ treat.grp_3  <dbl> 1
#> $ treat.grp_6  <dbl> NA
#> $ treat.grp_9  <dbl> NA
#> $ studydisc_0  <dbl> 0
#> $ studydisc_1  <dbl> 0
#> $ studydisc_12 <dbl> NA
#> $ studydisc_15 <dbl> NA
#> $ studydisc_18 <dbl> NA
#> $ studydisc_3  <dbl> 1
#> $ studydisc_6  <dbl> NA
#> $ studydisc_9  <dbl> NA
#> $ ht_0         <chr> "H"
#> $ ht_1         <chr> "H10"
#> $ ht_12        <chr> "H101011NANANANA"
#> $ ht_15        <chr> "H101011NANANANANANA"
#> $ ht_18        <chr> "H101011NANANANANANANANA"
#> $ ht_3         <chr> "H1010"
#> $ ht_6         <chr> "H101011"
#> $ ht_9         <chr> "H101011NANA"
#> $ hs_0         <chr> "H1"
#> $ hs_1         <chr> "H101"
#> $ hs_12        <chr> "H101011NANANANANA"
#> $ hs_15        <chr> "H101011NANANANANANANA"
#> $ hs_18        <chr> "H101011NANANANANANANANANA"
#> $ hs_3         <chr> "H10101"
#> $ hs_6         <chr> "H101011NA"
#> $ hs_9         <chr> "H101011NANANA"

In our example, we can ignore the treatment arm history variables “ht” that makehistory.two() created since treatment arm is fixed throughout the study. Note that the makehistory.one() and make.history.two() functions allow users to stratify history variables by a baseline covariate via the group argument. It should never be used for a true exposure, as doing so would inaccurately classify any persons who intermittently miss visits throughout the study.

STEP 2: CREATE A TIDY DATAFRAME

The next step is to use lengthen() to transform our wide dataframe into one that is jointly indexed by person-id, exposure measurement time, and covariate measurement time (i.e., tidy). This data structure is the secret sauce that allows for much of this software’s flexibility. Here, we have to specify which diagnostic we want because the tidy structure for diagnostic 2 (exposure-covariate feedback) differs from the one used for diagnostics 1 and 3 (time-varying confounding/selection-bias).

When creating this tidy dataset, lengthen() automatically deletes observations that have missing measurements for exposure or a specific covariate at a given time. Thus, it carries out a form of complete case assessment by default. There is a mandatory censoring argument that we have to supply to let lengthen() know whether there are additional observations to remove according to an artificial censoring rule. Since we are evaluating balance for censoring due to study dropout, and all data after study dropout are missing, we don’t need any artificial censoring rules to remove data after study dropout, so we set the censoring argument to "no". However, if we were evaluating balance for censoring due to treatment protocol deviation, we would need to set censoring to "yes" and use the censor argument to specify the appropriate indicator variable (0=drop, 1=keep).

catie_sim.whl1 <- lengthen(
  input=catie_sim.wh,
  id="CATIEID",
  diagnostic=1,
  censoring="no",
  exposure="studydisc",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16",
                       "pct.gain","epsmean",
                       "qoltot"),
  times.exposure=c(0,1,3,6,9,12,15,18), 
  times.covariate=c(0,1,3,6,9,12,15,18),
  history="hs"
)
#> Warning in lengthen(input = catie_sim.wh, id = "CATIEID", diagnostic =
#> 1, : The exposure and/or some covariates contain missing data. Subsequent
#> calculations are not guaranteed to be unbiased in the presence of partially
#> missing data.
glimpse(catie_sim.whl1 %>% filter(CATIEID==1440 & name.cov=="pct.gain"))
#> Observations: 6
#> Variables: 7
#> $ CATIEID        <chr> "1440", "1440", "1440", "1440", "1440", "1440"
#> $ name.cov       <chr> "pct.gain", "pct.gain", "pct.gain", "pct.gain",...
#> $ time.exposure  <dbl> 0, 1, 1, 3, 3, 3
#> $ time.covariate <dbl> 0, 0, 1, 0, 1, 3
#> $ hs             <chr> "H1", "H101", "H101", "H10101", "H10101", "H10101"
#> $ studydisc      <dbl> 0, 0, 0, 1, 1, 1
#> $ value.cov      <dbl> 0.00000000, 0.00000000, 0.01935687, 0.00000000,...

You should notice that lengthen() has produced a tidy dataframe where, for diagnostic 1, exposure times are greater or equal to covariate times.

Omitting covariate history

We can use omit.history() to ensure that our balance evaluations accord with our assumption that, at each follow-up visit, study dropout is missing at random given the most recent measures of time-varying covariates, the treatment arm, and baseline covariates. The omit.history() function requires that users specify the covariates to be modified (via the covariate argument), and how they want to remove irrelevant covariate history (via the omission argument). There are three types of omissions that can be specified:

  • "fixed": sets covariate values to missing for certain measurement times \(t\) (specified in the times argument). For example, we might use this for covariate “l” at time 2 to set all data on “l” at time 2 to missing, regardless of the exposure measurement time.

  • "relative": sets covariate values to missing only for covariate measurement times at or before a certan distance \(k\) from the exposure measurement times (specified in distance argument). For example, we might use this to set data for covariate “l” with distance 2 to missing data only when it is measured two units or more before the exposure measurement time (i.e. \(t-2, t-3, t-4\) and so on).

  • "same.time": sets covariate values to missing only for covariate measurement times that align with exposure measurement times (no additional argument needed). This can be useful when, at any given time, an exposure precedes a particular covariate.

catie_sim.whlo1 <- omit.history(
  input=catie_sim.whl1,
  covariate.name=c("phase.change.cum","Chg.pansstotal",
                   "cs14","cs16","pct.gain",
                   "epsmean","qoltot"),
  omission="relative",
  distance="1"
)
glimpse(catie_sim.whlo1 %>% filter(CATIEID==1440 & name.cov=="pct.gain"))
#> Observations: 3
#> Variables: 7
#> $ CATIEID        <chr> "1440", "1440", "1440"
#> $ name.cov       <chr> "pct.gain", "pct.gain", "pct.gain"
#> $ time.exposure  <dbl> 0, 1, 3
#> $ time.covariate <dbl> 0, 1, 3
#> $ hs             <chr> "H1", "H101", "H10101"
#> $ studydisc      <dbl> 0, 0, 1
#> $ value.cov      <dbl> 0.00000000, 0.01935687, 0.05807060

Because omit.history can be applied repeatedly to the dataset output from lengthen(), users can encode different depths of covariate history for each covariate, if desired.

STEP 3: CREATE A COVARIATE BALANCE TABLE

Now we are ready to use balance() to create a balance table comparing the censored to the uncensored over time. As was the case with lengthen() we have to specify what we want to do. The requirements vary (and are detailed further down) but at minimum we need to specify the diagnostic of interest, what approach is used to adjust the diagnostic (if any), whether or not additional censoring rules are to be applied, and the measurement times of interest for exposures and covariates.

We also have to specify the scope through which we will examine person-time.

  • "all" provides balance metrics for each exposure-covariate time pairing in the tidy dataset.

  • "average" provides balance metrics that standardize over different dimensions of person-time. An additional argument, average.over is used to specify the dimension to be averaged over:

  • "strata" averages over strata within levels of exposure values, history, time, and distance

  • "values" averages over strata, non-referent exposure values within levels of history, time, and distance

  • "history" averages over strata, non-referent exposure values and history within levels of time and distance

  • "time" averages over strata, non-referent exposure values, history, and time within levels of distance

  • "distance" averages over non-referent exposure values, history, time and distance. If desired, once can use the additional argument periods to average over segmented pools of distance, as shown later on.

  • "recent" provides balance metrics only for a chosen distance \(k\) (specified with the additional argument ‘recency’). Specifying a scope="recent" with recency=0 is a quick way to assess balance for the most recent covariate measurements.

There are optional arguments that can help us cope with sparse data. For example, in our example, we can optionally specify sd.ref to calculate the standardized mean difference using the covariate standard deviation from the uncensored at time \(t\) given censoring history (the default uses the corresponding pooled estimate). If we were taking averages over person-time, we can use ignore.missing.metric to decide whether or not to report standardized mean differences when some dimensions of person-time lack variation in a covariate (or cases where everyone or no one is censored given history).

For our example, in line with our assumptions, let us examine balance for the most recent covariates:

baltbl1 <- balance(
  input=catie_sim.whlo1,
  diagnostic=1,
  approach="none",
  censoring="no",
  scope="recent",
  recency=0,
  exposure="studydisc",
  history="hs",
  times.exposure=c(0,1,3,6,9,12,15,18),
  times.covariate=c(0,1,3,6,9,12,15,18),
  sd.ref="yes"
)
#> Warning in balance(input = catie_sim.whlo1, diagnostic = 1, approach =
#> "none", : Some exposure times have no exposure variation within levels of
#> exposure history. Estimates for these times will not appear in the results
print(baltbl1 %>% filter(name.cov=="pct.gain" & time.exposure<=6) %>% mutate_at(c("D","SMD"),round,3))
#>    E        H name.cov time.exposure time.covariate      D    SMD   N Nexp
#> 1  1       H1 pct.gain             0              0  0.000  0.000 345   29
#> 2  1     H101 pct.gain             1              1 -0.015 -0.314 316   34
#> 3  1   H10101 pct.gain             3              3  0.016  0.113 282   36
#> 4  1 H1010101 pct.gain             6              6 -0.086 -0.293 246   21
#> 5  1       H2 pct.gain             0              0  0.000  0.000 347   27
#> 6  1     H202 pct.gain             1              1 -0.026 -0.496 320   37
#> 7  1   H20202 pct.gain             3              3  0.002  0.011 283   30
#> 8  1 H2020202 pct.gain             6              6  0.010  0.032 253   20
#> 9  1       H3 pct.gain             0              0  0.000  0.000 310   24
#> 10 1     H303 pct.gain             1              1  0.008  0.164 286   25
#> 11 1   H30303 pct.gain             3              3  0.042  0.273 261   24
#> 12 1 H3030303 pct.gain             6              6  0.303  1.168 237   16
#> 13 1       H4 pct.gain             0              0  0.000  0.000 183   11
#> 14 1     H404 pct.gain             1              1 -0.010 -0.186 172   12
#> 15 1   H40404 pct.gain             3              3 -0.051 -0.315 160   15
#> 16 1 H4040404 pct.gain             6              6 -0.021 -0.063 145   15
#> 17 1       H5 pct.gain             0              0  0.000  0.000 244   17
#> 18 1     H505 pct.gain             1              1 -0.006 -0.114 227   16
#> 19 1   H50505 pct.gain             3              3  0.028  0.182 211   26
#> 20 1 H5050505 pct.gain             6              6 -0.076 -0.245 185    9

In the table produced by balance(), E is the exposure value that the referent value is compared to, H is the exposure history. The exact format of the output table will depend on the arguments supplied to balance(). There are two reasons for this. The first reason is that diagnostics 1, 2, and 3 differ in when they require exposure history H and strata S (see below for more details), and these variables will only appear when they have been used to adjust the balance estimate. The second reason kicks in when using scope="average" and average.over="..." to request balance estimates that average over some dimension of persion-time.

STEP 4: CREATE A COVARIATE BALANCE PLOT

Our final step is to use makeplot() to create a trellised covariate balance plot. The required syntax is detailed below, but it is typically short. All that is required is to specify diagnostic, approach, censoring, and scope (and if need be, average.over) which should all follow the parameters specified in balance(). These are sufficient to tell makeplot() what kind of plot to produce, regardless of the underlying data. There are a host of optional formatting arguments that you can use to produce publication-quality plots. See ?makeplot for details.

balplot1 <-makeplot(
  input=baltbl1,
  diagnostic=1,
  approach="none",
  censoring="no",
  scope="recent",
  label.exposure="Study Dropout",
  label.covariate="Covariate"
)
balplot1

This plot shows the standardized mean difference (x-axis) for each covariate (y-axis) at different pairings of dropout and covariate measurement times (panels), among the uncensored in each treatment arm (dots). This plot portrays our decision to restrict our pairing of times to those where covariates are measured just prior to dropout. We see here that several covariates are strongly associated with dropout over time, with standarized mean differences much larger than 0.25 (the chosen reference lines).

The plot produced by makeplot() will vary according to the parameters, which should share those used with balance(). For example, if we had used scope="all" with balance() and makeplot(), this would have arrayed the dropout and measurment parings across a two-dimensional grid, where panel columns correspond to covariate measurement times and panel rows correspond to dropout times. Such displays are useful for distinguishing balance of “static” covariates that do not vary over time from “temporal” covariates that do (as shown below). Sometimes, it is not possible to estimate certain balance metrics due to lack of variation, as would happen if everyone dropped out or shared the same covariate value. When this happens, makeplot() will warn you about how many balance estimates are missing from the plot. You can identify these data elements in the balance table where their balance estimates will have missing values.

We can use optional arguments in makeplot() to produce a publication-quality plot. See ?makeplot for details. Once we are satisfied with the format, we can use ggsave() function from the ggplot2() package to save the plot as a PDF with a desired height and width. For example:

#ggsave(balplot1,"balplot1.pdf",height=7.5,width=10,units="in")

DEMONSTRATION OF DIAGNOSTIC 3

We have seen with diagnostic 1 that there is dependent-censoring over time. That is, study dropout is associated with measured covariates. Suppose we wish to remove dependent-censoring by using inverse probability of censoring weights to balance these covariates across levels of study dropout. We can use diagnostic 3 to examine how well our estimated weights accomplish this task. For demonstration purposes, we are going to pursue a different plot that incorporates balance across static covariates measured at baseline, using a limited selection of times to keep the plot readable.

To start, we return to our widened dataset and transform it to a tidy dataset under diagnostic 3. The call differs from before only in that we have added two extra arguments, static.covariate and weight.exposure. (If we had specified these for diagnostic 1, we wouldn’t need to reproduce a tidy dataset, because it will share the same structure for diagnostics 1 and 3).

catie_sim.whl3 <- lengthen(
  input=catie_sim.wh,
  id="CATIEID",
  diagnostic=3,
  censoring="no",
  exposure="studydisc",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16",
                       "pct.gain","epsmean",
                       "qoltot"),
  static.covariate=c("exacer","Bpansstotal",
                     "unemployed"),
  times.exposure=c(0,1,3,6), 
  times.covariate=c(0,1,3,6),
  history="hs",
  weight.exposure="wtr"
)
#> Warning in lengthen(input = catie_sim.wh, id = "CATIEID", diagnostic =
#> 3, : The exposure and/or some covariates contain missing data. Subsequent
#> calculations are not guaranteed to be unbiased in the presence of partially
#> missing data.
glimpse(catie_sim.whl3 %>% filter(CATIEID==1440 & name.cov=="pct.gain"))
#> Observations: 6
#> Variables: 8
#> $ CATIEID        <chr> "1440", "1440", "1440", "1440", "1440", "1440"
#> $ name.cov       <chr> "pct.gain", "pct.gain", "pct.gain", "pct.gain",...
#> $ time.exposure  <dbl> 0, 1, 1, 3, 3, 3
#> $ time.covariate <dbl> 0, 0, 1, 0, 1, 3
#> $ hs             <chr> "H1", "H101", "H101", "H10101", "H10101", "H10101"
#> $ studydisc      <dbl> 0, 0, 0, 1, 1, 1
#> $ value.cov      <dbl> 0.00000000, 0.00000000, 0.01935687, 0.00000000,...
#> $ wtr            <dbl> 1.0039199, 0.9107662, 0.9107662, 1.2994836, 1.2...

In this new tidy dataset, the weights appear as an extra column, and the static covariates appear as extra rows, but only at the baseline covariate measurement time. The next step is to use omit.history() as we did before to encode our assumption about dependent censoring through the most recently measured temporal covariates.

catie_sim.whlo3 <- omit.history(
  input=catie_sim.whl3,
  covariate.name=c("phase.change.cum","Chg.pansstotal","cs14","cs16","pct.gain","epsmean","qoltot"),
  omission="relative",
  distance=1
)
glimpse(catie_sim.whlo3 %>% filter(CATIEID==1440 & name.cov=="pct.gain"))
#> Observations: 3
#> Variables: 8
#> $ CATIEID        <chr> "1440", "1440", "1440"
#> $ name.cov       <chr> "pct.gain", "pct.gain", "pct.gain"
#> $ time.exposure  <dbl> 0, 1, 3
#> $ time.covariate <dbl> 0, 1, 3
#> $ hs             <chr> "H1", "H101", "H10101"
#> $ studydisc      <dbl> 0, 0, 1
#> $ value.cov      <dbl> 0.00000000, 0.01935687, 0.05807060
#> $ wtr            <dbl> 1.0039199, 0.9107662, 1.2994836

The next step is to use balance() as we did before, but this time with scope="all" under the restricted set of measurement times, approach="weight", and an additional argument weight.exposure to specify the inverse probabilty of censoring weights. Again, in our example we are treating study dropout as the exposure and our data are already implicitly censored by study dropout, which is why we specified censoring="no". We have no other censoring mechanisms to incorporate via the censor argument. For the same reason, we have no other inverse probability of censoring weights to incorporate (e.g., due to protocol violation) via the weight.censor argument. In this example, all that is needed is to specify the weight.exposure argument because we are treating study dropout as an exposure and there are no further censoring mechanisms to account for.

baltbl3 <- balance(
  input=catie_sim.whlo3,
  diagnostic=3,
  approach="weight",
  censoring="no",
  scope="all",
  exposure="studydisc",
  history="hs",
  times.exposure=c(0,1,3,6),
  times.covariate=c(0,1,3,6),
  sd.ref="yes",
  weight.exposure="wtr"
)
print(baltbl3 %>% filter(name.cov=="pct.gain" & time.exposure<=6) %>% mutate_at(c("D","SMD","N","Nexp"),round,3))
#>    E        H name.cov time.exposure time.covariate      D    SMD       N
#> 1  1       H1 pct.gain             0              0  0.000  0.000 345.893
#> 2  1     H101 pct.gain             1              1 -0.011 -0.222 312.328
#> 3  1   H10101 pct.gain             3              3  0.037  0.258 278.774
#> 4  1 H1010101 pct.gain             6              6 -0.022 -0.074 243.930
#> 5  1       H2 pct.gain             0              0  0.000  0.000 345.862
#> 6  1     H202 pct.gain             1              1 -0.024 -0.448 318.047
#> 7  1   H20202 pct.gain             3              3  0.016  0.102 283.362
#> 8  1 H2020202 pct.gain             6              6  0.075  0.236 252.160
#> 9  1       H3 pct.gain             0              0  0.000  0.000 309.433
#> 10 1     H303 pct.gain             1              1  0.009  0.172 282.142
#> 11 1   H30303 pct.gain             3              3  0.017  0.113 259.506
#> 12 1 H3030303 pct.gain             6              6  0.141  0.544 233.472
#> 13 1       H4 pct.gain             0              0  0.000  0.000 185.533
#> 14 1     H404 pct.gain             1              1 -0.008 -0.148 174.135
#> 15 1   H40404 pct.gain             3              3 -0.043 -0.262 156.578
#> 16 1 H4040404 pct.gain             6              6  0.002  0.006 139.143
#> 17 1       H5 pct.gain             0              0  0.000  0.000 238.766
#> 18 1     H505 pct.gain             1              1  0.000 -0.003 225.239
#> 19 1   H50505 pct.gain             3              3  0.028  0.179 205.156
#> 20 1 H5050505 pct.gain             6              6 -0.040 -0.131 183.302
#>      Nexp
#> 1  30.706
#> 2  32.081
#> 3  34.111
#> 4  20.997
#> 5  25.452
#> 6  34.888
#> 7  29.488
#> 8  18.298
#> 9  23.701
#> 10 22.241
#> 11 23.096
#> 12 13.422
#> 13 13.788
#> 14 13.264
#> 15 11.930
#> 16 12.845
#> 17 12.713
#> 18 15.914
#> 19 21.319
#> 20  8.403

We are now ready to produce our balance plot via makeplot():

balplot3 <- makeplot(
  input=baltbl3,
  diagnostic=3,
  approach="weight",
  censoring="no",
  scope="all",
  label.exposure="Study Dropout",
  label.covariate="Covariate"
)
balplot3

We see in this plot that the weights have effectively reduced much of the imbalance for most covariates, at least through time 6. However, we also see that strong imbalances persist for other covariates at certain time points and treatment arms, a potential signal that our model for study dropout is not flexible enough, or that other issues, such as positivity violations are at play. Imbalances can also persist when inverse probability weights are truncated to certain thresholds. The plot is also informative in that we see persistent imbalances across baseline covariates as well. This is to be expected as these variables were used to stabilize the weights. This re-emphasizes the need to condition our final analyses on these baseline covariates.

A natural question is whether we can produce plots that adjust for baseline covariates. There are two options. If there are only a few such covariates and these can be combined into a master categorical variable, this variable can be wrapped into the history strata using either makehistory() function. Averaging over history with balance() will then adjust for this categorical variable. The alternative is to merge baseline covariates to the tidy dataset produced by lengthen(), and then use the broom package to produce balance estimates that adjust for multiple covariates, even continuous ones. If the output were reformatted appropriately, it could be plotted with makeplot() or with ggplot2 or lattice directly. This latter extension with modeling has not yet been automated into the software.

In all, assessments such as diagnostics 1 and 3 can be critical for describing the nature of time-varying confounding, and also selection-bias as we have demonstrated in this example. These assessments are certainly of use to investigators who can examine such issues and adapt their modeling approaches accordingly. Next, we turn to diagnostic 2, which tells us even more about the nature of time-varying confounding and/or selection-bias in complex longitudinal data.

DEMONSTRATION OF DIAGNOSTIC 2

In addition to describing selection-bias before and after applying inverse probability of censoring weights, we can also examine whether covariates are affected by exposure or an unmeasured cause of exposure. If it were present, conditioning on covariates during follow-up would not only potentially block treatment effects, conditioning would induce selection-bias, and so would not be a desirable approach for resolving the selection-bias. Treatment-confounder feedback motivates the use of inverse probability of censoring weights, or a time-varying application of multiple imputation, to resolve the selection-bias.

To evaluate treatment-confounder feedback, we simply assess the balance of each covariate measurement across each prior exposure measurement, adjusting for confounders that precede the exposure. Thankfully, we can use the same functions (and thus workflow) as before to produce a plot that describes treatment-confounder feedback. Recall that, in our study design, treatment arm is held fixed throughout the study. As a result, the plot will be much simpler than it would be if we wanted to evaluate feedback for a treatment that varied over time. Moreover, because the treatment arm assignment is randomized, there is no need to adjust for baseline covariates via adjustment with inverse probability weights or propensity score strata. In more general settings, we would need to indicate the root name of variables for (i) treatment history and inverse probability of treatment weights or (ii) propensity-score strata to adjust for prior treatment and covariate history that could confound this assessment.

In the presence of study dropout, as in our example, assessments of treatment-confounder feedback can suffer selection-bias. To address this, we include the root name of studydisc and its corresponding inverse probability of censoring weights wtr after having lagged these variables to align with covariate measurements at time \(t\) rather than time \(t+1\) (as was the case, implicitly, in this example).

catie.alt <- catie_sim %>%
  group_by(CATIEID) %>%
  arrange(CATIEID,time) %>% 
  mutate(wtr.lag=ifelse(.data$time==0,1,lag(.data$wtr)),
         studydisc.lag=ifelse(.data$time==0,0,lag(.data$studydisc)))
glimpse(catie.alt %>% select(CATIEID,time,studydisc,studydisc.lag,wtr,wtr.lag,pct.gain) %>% filter(CATIEID==1440))  
#> Observations: 3
#> Variables: 7
#> Groups: CATIEID [1]
#> $ CATIEID       <int> 1440, 1440, 1440
#> $ time          <int> 0, 1, 3
#> $ studydisc     <int> 0, 0, 1
#> $ studydisc.lag <dbl> 0, 0, 0
#> $ wtr           <dbl> 1.0039199, 0.9107662, 1.2994836
#> $ wtr.lag       <dbl> 1.0000000, 1.0039199, 0.9107662
#> $ pct.gain      <dbl> 0.00000000, 0.01935687, 0.05807060

In our example, studydisc.lag becomes a constant of 0. However, lengthen() requires a variable describing censoring indicators whenever weight.censor() is used, so we include it.

catie.alt.w <- widen(
 input=catie.alt,
 id="CATIEID",
 time="time",
 exposure="treat.grp",
 covariate=c("phase.change.cum","Chg.pansstotal",
             "cs14","cs16",
             "pct.gain","epsmean",
             "qoltot",
             "exacer","Bpansstotal","unemployed"),
 censor="studydisc.lag",
 weight.censor="wtr.lag"             
)
#> Adding missing grouping variables: `ID`
glimpse(catie.alt.w %>% select(CATIEID,starts_with("treat.grp"),starts_with("studydisc"),starts_with("pct.gain")) %>% filter(CATIEID==1440))
#> Observations: 1
#> Variables: 25
#> $ CATIEID          <int> 1440
#> $ treat.grp_0      <dbl> 1
#> $ treat.grp_1      <dbl> 1
#> $ treat.grp_12     <dbl> NA
#> $ treat.grp_15     <dbl> NA
#> $ treat.grp_18     <dbl> NA
#> $ treat.grp_3      <dbl> 1
#> $ treat.grp_6      <dbl> NA
#> $ treat.grp_9      <dbl> NA
#> $ studydisc.lag_0  <dbl> 0
#> $ studydisc.lag_1  <dbl> 0
#> $ studydisc.lag_12 <dbl> NA
#> $ studydisc.lag_15 <dbl> NA
#> $ studydisc.lag_18 <dbl> NA
#> $ studydisc.lag_3  <dbl> 0
#> $ studydisc.lag_6  <dbl> NA
#> $ studydisc.lag_9  <dbl> NA
#> $ pct.gain_0       <dbl> 0
#> $ pct.gain_1       <dbl> 0.01935687
#> $ pct.gain_12      <dbl> NA
#> $ pct.gain_15      <dbl> NA
#> $ pct.gain_18      <dbl> NA
#> $ pct.gain_3       <dbl> 0.0580706
#> $ pct.gain_6       <dbl> NA
#> $ pct.gain_9       <dbl> NA

catie.alt.wl2 <- lengthen(
  input=catie.alt.w,
  id="CATIEID",
  diagnostic=2,
  censoring="yes",
  exposure="treat.grp",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16",
                       "pct.gain","epsmean",
                       "qoltot"),
  times.exposure=0, 
  times.covariate=c(0,1,3,6,9,12,15,18),
  censor="studydisc.lag",
  weight.censor="wtr.lag"
)
#> Warning in lengthen(input = catie.alt.w, id = "CATIEID", diagnostic = 2, :
#> WARNING: For diagnostic 2, unless exposure is randomized, specify the root
#> names for (i) exposure history and exposure weights or (ii) strata
#> Warning in lengthen(input = catie.alt.w, id = "CATIEID", diagnostic =
#> 2, : The exposure and/or some covariates contain missing data. Subsequent
#> calculations are not guaranteed to be unbiased in the presence of partially
#> missing data.
glimpse(catie.alt.wl2 %>% filter(CATIEID==1440 & name.cov=="pct.gain"))
#> Observations: 2
#> Variables: 8
#> $ CATIEID        <chr> "1440", "1440"
#> $ name.cov       <chr> "pct.gain", "pct.gain"
#> $ time.exposure  <dbl> 0, 0
#> $ time.covariate <dbl> 1, 3
#> $ treat.grp      <dbl> 1, 1
#> $ value.cov      <dbl> 0.01935687, 0.05807060
#> $ censor         <dbl> 0, 0
#> $ wtr.lag        <dbl> 1.0039199, 0.9107662

lengthen() generally creates a tidy dataset as before, but now restricted to observtions where exposure times precede covariate times. Through our call, We have further restricted the baseline exposure time since it is fixed over time. Furthermore, when lengthen() incorporates weights via weight.censor, it aligns them with measurement times of covariates rather than exposures, in contrast to diagnostic 3. We can skip omit.history() with diagnostic 2 as it is not relevant for assessing treatment-confounder feedback. When exposures vary over time, the balance of all prior exposures needs to be assessed. With our example, we are only concerned with the treatment arm assignment at baseline.

With balance() and makeplot() we have an array of possibilities. Here, we specify scope="average" and average.over="distance" to demonstrate how balance metrics can be averaged over person-time, to concisely summarize information over all follow-up times:

baltbl2 <- balance(
  input=catie.alt.wl2,
  diagnostic=2,
  approach="none",
  censoring="yes",
  scope="average",
  average.over="distance",
  periods=list(0:3,6:9,12:18),
  exposure="treat.grp",
  times.exposure=0,
  times.covariate=c(0,1,3,6,9,12,15,18),
  weight.censor="wtr.lag",
  sort.order=c("Chg.pansstotal","cs16",
               "pct.gain","epsmean",
               "qoltot","phase.change.cum",
               "cs14")
)
print(baltbl2 %>% filter(name.cov=="pct.gain") %>% mutate_at(c("D","SMD"),round,3))
#>   period.id period.start period.end name.cov      D    SMD        N
#> 1         1            1          3 pct.gain -0.016 -0.166 4298.928
#> 2         2            6          9 pct.gain -0.059 -0.157 3444.263
#> 3         3           12         18 pct.gain -0.121 -0.168 4376.045

We also make use of the optional sort.order argument in balance() to order the covariates in the table, which will carry through to makeplot().

balplot2 <- makeplot(
  input=baltbl2,
  diagnostic=2,
  approach="none",
  censoring="yes",
  scope="average",
  average.over="distance",
  label.exposure="Treatment Group",
  label.covariate="Covariate"
)
balplot2

In this plot, each panel represents the average balance of a covariate at time \(t\) across levels of the treatment arm at baseline (at \(t=0\)). We only see one dot because, balance() first computed metrics comparing each non-referent value to a referent value, took a weighted average of these, and then further took weighted averages according to the spacing between treatment arm and covariate measurements. We see that at near and distant intervals, the change in pansstotal score, and the percentage of weight gain, appears to differ across treatment arms. Because this is a randomized study, we do not expect these differences to be attributable to confounding by baseline covariates. Moreover, we have reduced the contribution of selection-bias by incorporating inverse probability of censoring weights. Therefore, it appears that both change in total PANSS score and percentage of weight gain is involved in treatment-confounder feedback.

ITERATIVE PROCESSING FOR LARGE DATA

One drawback of using a tidy data structure to drive the functions is that, it may take substantial memory to process data from studies with a large number of persons, measurements, and/or covariates. To cope with this, we provide the diagnose() macro that takes the “wide” dataframe and iteratively calls lengthen() and balance() by looping over covariates. In the future we will extend this capability to allow for looping over measurement times as well. The output data from diagnose() can be read directly into makeplot().

Note that there is an important change in workflow when users wish to use the diagnose() function and also remove irrelevant covariate history from the balance table calculations and plot. Users who wish to do so will need to first use diagnose() with scope=all. Then omit.history() can be applied to the dataset output from diagnose(). From that point, if desired, apply.scope() can be used to restrict to a certain recency or obtain summaries over person-time. This workflow change allows users to iterate while ensuring that the metrics ignore covariate history the user deems irrelevant to confounding.

diagtbl3 <- diagnose(
  input=catie_sim.wh,
  id="CATIEID",
  diagnostic=3,
  approach="weight",
  censoring="no",
  scope="all",
  exposure="studydisc",
  history="hs",
  temporal.covariate=c("phase.change.cum",
                       "Chg.pansstotal","cs14",
                       "cs16","pct.gain",
                       "epsmean","qoltot"),
  times.exposure=c(0,1,3,6,9,12,15,18),
  times.covariate=c(0,1,3,6,9,12,15,18),
  weight.exposure="wtr"
)
#> Warning in lengthen(input = input, diagnostic = diagnostic, censoring =
#> censoring, : The exposure and/or some covariates contain missing data.
#> Subsequent calculations are not guaranteed to be unbiased in the presence
#> of partially missing data.
#> Warning in balance(input = output.lengthen, diagnostic = diagnostic,
#> approach = approach, : Some exposure times have no exposure variation
#> within levels of exposure history. Estimates for these times will not
#> appear in the results
#> Warning in balance(input = output.lengthen, diagnostic = diagnostic,
#> approach = approach, : SMD values have been set to missing where there is
#> no covariate variation within some level of time-exposure, time-covariate,
#> exposure history, and exposure value; in this case averages for SMD
#> estimates will also appear as missing

diagtbl3.o <- omit.history(
  input=diagtbl3,
  covariate.name=c("phase.change.cum","Chg.pansstotal",
                   "cs14","cs16",
                   "pct.gain","epsmean",
                   "qoltot"),
  omission="relative",
  distance=1
)

diagtbl3.oa <- apply.scope(
  input=diagtbl3,#input=diagtbl3.o,
  diagnostic=3,
  approach="weight",
  scope="recent",
  recency=0
)

diagplot3.oa <- makeplot(
  input=diagtbl3.oa,
  diagnostic=3,
  approach="weight",
  censoring="no",
  scope="recent",
  average.over="distance",
  label.exposure="Treatment Group",
  label.covariate="Covariate"
)
diagplot3.oa

ADAPTING DATA THAT IS NOT TIME-INDEXED

Sometimes data are not indexed by measurement times. For example, suppose our trial only had one point of follow-up. We might receive a dataset like the following:

catie_sim.x <- catie_sim %>% filter(time==0) %>% select(-time)
print(catie_sim.x %>% select(CATIEID,treat.grp,studydisc,pct.gain,lead.pansstotal) %>% filter(CATIEID==1440))
#> # A tibble: 1 x 5
#>   CATIEID treat.grp studydisc pct.gain lead.pansstotal
#>     <int>     <int>     <int>    <dbl>           <dbl>
#> 1    1440         1         0        0            54.6

Let’s say we want to evaluate covariate balance across treatment arms, and then covariate balance across levels of study dropout by visit 1 given treatment arm. All we need to do is add a constant time index to the data, and then we can proceed exactly as we did before. For example:

#add time index, widen, and add history
catie_sim.0 <- catie_sim.x %>% mutate(time=0)

catie_sim.0.w <- widen(
 input=catie_sim.0,
 id="CATIEID",
 time="time",
 exposure=c("studydisc","treat.grp"),
 covariate=c("cs14","cs16","weight",
             "pansstotal","epsmean",
             "qoltot",
             "exacer","unemployed",
             "white","black",
             "age.grp.1824","age.grp.2534")
) 
#> Adding missing grouping variables: `ID`

catie_sim.0.wh <- makehistory.two(
 input=catie_sim.0.w,
 id="CATIEID",
 exposure.a="treat.grp",
 exposure.b="studydisc",
 times=0,
 name.history.a="ht",
 name.history.b="hs"
)

#balance across treatment arm
diagtbl.0t <- diagnose(
  input=catie_sim.0.wh,
  id="CATIEID",
  diagnostic=1,
  approach="none",
  censoring="no",
  scope="all",
  exposure="treat.grp",
  history="ht",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16","pct.gain",
                       "epsmean","qoltot"),
  static.covariate=c("exacer","Bpansstotal","unemployed",
                     "white","black",
                     "age.grp1824","age.grp.25344"),
  times.exposure=0,
  times.covariate=0
)
glimpse(diagtbl.0t)
#> Observations: 32
#> Variables: 9
#> $ E              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,...
#> $ H              <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H...
#> $ name.cov       <fct> black, cs14, cs16, epsmean, exacer, qoltot, une...
#> $ time.exposure  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ time.covariate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ D              <dbl> -0.036670426, -0.073590203, 0.031728733, 0.0031...
#> $ SMD            <dbl> -0.077143777, -0.139478048, 0.049935544, 0.0140...
#> $ N              <dbl> 692, 692, 692, 692, 692, 692, 692, 692, 655, 65...
#> $ Nexp           <dbl> 347, 347, 347, 347, 347, 347, 347, 347, 310, 31...

#balance across censoring given treatment arm
diagtbl.0s <- diagnose(
  input=catie_sim.0.wh,
  id="CATIEID",
  diagnostic=1,
  approach="none",
  censoring="no",
  scope="all",
  exposure="studydisc",
  history="hs",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16","pct.gain",
                       "epsmean","qoltot"),
  static.covariate=c("exacer","Bpansstotal",
                     "unemployed","white","black",
                     "age.grp1824","age.grp.25344"),
  times.exposure=0,
  times.covariate=0
)
glimpse(diagtbl.0s)
#> Observations: 40
#> Variables: 9
#> $ E              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ H              <chr> "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1",...
#> $ name.cov       <fct> black, cs14, cs16, epsmean, exacer, qoltot, une...
#> $ time.exposure  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ time.covariate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ D              <dbl> 0.0185508512, -0.0159209271, -0.0408091679, 0.0...
#> $ SMD            <dbl> 0.0384838936, -0.0280695906, -0.0670020166, 0.4...
#> $ N              <dbl> 345, 345, 345, 345, 345, 345, 345, 345, 347, 34...
#> $ Nexp           <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 27, 27, 27, 27,...

You can use the exact same procedure to appropriately examine covariate balance in a study of multiple distinct exposures, such as interaction or mediation.

REQUIRED ARGUMENTS, BY FUNCTION

Required arguments for lengthen() by diagnostic, approach, and censoring arguments |

Diagnostic Approach Censoring Additional Args
1 “none” “no” id, exposure, …covariate, times…, history
1 “none” “yes” id, exposure, …covariate, times…, history, censor
2 “none” “no” id, exposure, …covariate, times…,
2 “none” “yes” id, exposure, …covariate, times…, history, censor
2 “weight” “no” id, exposure, …covariate, times…, history, weight.exposure
2 “weight” “yes” id, exposure, …covariate, times…, history, weight.exposure
2 “stratify” “no” id, exposure, …covariate, times…, strata
2 “stratify” “yes” id, exposure, …covariate, times…, strata, censor
3 “weight” “no” id, exposure, …covariate, times…, history, weight.exposure
3 “weight” “yes” id, exposure, …covariate, times…, history, weight.exposure, censor
3 “stratify” “no” id, exposure, …covariate, times…, history, strata
3 “stratify” “yes” id, exposure, …covariate, times…, history, strata, censor

Required arguments for balance() by diagnostic, approach, and censoring arguments |

Diagnostic Approach Censoring Additional Args in addition to Scope
1 “none” “no” exposure, times…, history
1 “none” “yes” exposure, times…, history, censor
2 “none” “no” exposure, times..,
2 “none” “yes” exposure, times…, history, censor
2 “weight” “no” exposure, times…, history, weight.exposure
2 “weight” “yes” exposure, times…, history, weight.exposure
2 “stratify” “no” exposure, times…, strata
2 “stratify” “yes” exposure, times…, strata, censor
3 “weight” “no” exposure, times…, history, weight.exposure
3 “weight” “yes” exposure, times…, history, weight.exposure, censor
3 “stratify” “no” exposure, times…, history, strata
3 “stratify” “yes” exposure, times…, history, strata, censor

Required arguments for balance() by diagnostic, approach, and censoring arguments |

Diagnostic Approach Censoring Additional Args in addition to Scope and Metric
1 “none” “no”
1 “none” “yes”
2 “none” “no”
2 “none” “yes”
2 “weight” “no”
2 “weight” “yes”
2 “stratify” “no” stratum
2 “stratify” “yes” stratum
3 “weight” “no”
3 “weight” “yes”
3 “stratify” “no” stratum
3 “stratify” “yes” stratum

Requirements for other functions

Generally, apply.scope() will follow balance(). diagnose() will follow the combination of lengthen() and balance().

USING lengthen() WITH MODELS INSTEAD OF balance()


stdcov <- function(x) {
  y <- (x-mean(x))/sd(x)
}

catie_sim.s <- catie_sim %>%
 mutate(Chg.pansstotal=stdcov(.data$Chg.pansstotal),
  cs14=stdcov(.data$cs14),
  cs16=stdcov(.data$cs16),
  pct.gain=stdcov(.data$pct.gain),
  epsmean=stdcov(.data$epsmean),
  qoltot=stdcov(.data$qoltot)
)

catie_sim.sw <- widen(
 input=catie_sim.s,
 id="CATIEID",
 time="time",
 exposure="studydisc",
 covariate=c("treat.grp","phase.change.cum",
             "Chg.pansstotal",
             "cs14","cs16",
             "pct.gain","epsmean","qoltot",
             "exacer","Bpansstotal",
             "unemployed"),
 weight.exposure="wtr"             
)
#> Adding missing grouping variables: `ID`

catie_sim.swh <- makehistory.two(
 input=catie_sim.sw,
 id="CATIEID",
 exposure.a="treat.grp",
 exposure.b="studydisc",
 times=c(0,1,3,6,9,12,15,18),
 name.history.a="ht",
 name.history.b="hs"
)

catie_sim.swhl3 <- lengthen(
  input=catie_sim.swh,
  id="CATIEID",
  diagnostic=3,
  censoring="no",
  exposure="studydisc",
  temporal.covariate=c("phase.change.cum","Chg.pansstotal",
                       "cs14","cs16","pct.gain",
                       "epsmean","qoltot"),
  times.exposure=c(0,1,3,6,9,12,15,18), 
  times.covariate=c(0,1,3,6,9,12,15,18),
  history="hs",
  weight.exposure="wtr"
)
#> Warning in lengthen(input = catie_sim.swh, id = "CATIEID", diagnostic =
#> 3, : The exposure and/or some covariates contain missing data. Subsequent
#> calculations are not guaranteed to be unbiased in the presence of partially
#> missing data.

catie_sim.swhlo3 <- omit.history(
  input=catie_sim.swhl3,
  covariate.name=c("phase.change.cum","Chg.pansstotal",
                   "cs14","cs16","pct.gain",
                   "epsmean","qoltot"),
  omission="relative",
  distance="1"
)

Here is the model-based SMD (via the broom package):


library(broom)

modtbl <- catie_sim.swhlo3 %>%
  mutate(time=time.exposure,
         history=hs) %>%
  group_by(name.cov,history,time) %>%
  do(tidy(lm(formula=value.cov~studydisc,weights=wtr,.))) %>%
  filter(term=="studydisc") %>% 
  select(name.cov,time,history,term,estimate) %>%
  mutate(E=ifelse(term=="studydisc",1,0),
         time.exposure=time,
         time.covariate=time,
         H=history,
         SMD=estimate)%>%
         select(-term,time,history) %>%
         ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be
#> unreliable
print(modtbl)
#> # A tibble: 273 x 9
#>    name.cov  time history  estimate     E time.exposure time.covariate
#>    <chr>    <dbl> <chr>       <dbl> <dbl>         <dbl>          <dbl>
#>  1 Chg.pan~     0 H1       1.14e-17     1             0              0
#>  2 Chg.pan~     1 H101    -2.31e- 4     1             1              1
#>  3 Chg.pan~     3 H10101  -4.07e- 2     1             3              3
#>  4 Chg.pan~     6 H10101~  1.74e- 1     1             6              6
#>  5 Chg.pan~     9 H10101~ -9.04e- 4     1             9              9
#>  6 Chg.pan~    12 H10101~  9.44e- 4     1            12             12
#>  7 Chg.pan~    15 H10101~  9.67e- 2     1            15             15
#>  8 Chg.pan~    18 H10101~ -1.31e- 2     1            18             18
#>  9 Chg.pan~     0 H2       1.26e-17     1             0              0
#> 10 Chg.pan~     1 H202     2.79e- 2     1             1              1
#> # ... with 263 more rows, and 2 more variables: H <chr>, SMD <dbl>

Because we have formatted the table to like the output from balance(), we can read this directly into makeplot():

modplot <- makeplot(
  input=modtbl,
  diagnostic=3,
  approach="weight",
  censoring="no",
  scope="recent",
  label.exposure="Treatment Group",
  label.covariate="Covariate"
)
modplot

As implemented, this approach mimicks what the balance function would do–estimate balance within each level of time and history. However, if we moved history and/or time from the group_by() statement to the right-hand side of the model, we would begin to incorporate assumptions about model form and its constancy across covariates. Our estimates would rely on these assumptions being correct. The chief advantage of this approach is that, with some merging of baseline covariates to the lengthened dataset, we can leverage our assumptions to condition balance estimates on several covariates, including continuous ones, when this is desired.