In some cases the response variable may be an average of responses from within the same subject or group. In this case, the variances for two observations having the same covariates will not be identical if the size of the groups vary. Instead, it will be inversely proportional to the size of the group. Mathematically, if the observed response is the group average \(\bar{Y}_i=\sum_{j=1}^{n_i} Y_{ij}/n_i\) and \(Y_{ij} \sim N(\mu_i,\phi_i)\) where the dispersion parameter, \(\phi_i\), may vary by group and depend on covariates then \[\bar{Y}_i \sim N(\mu_i,\phi_i/n_i).\] This situation can be accommodated in dalmatian
by specifying a column of weights within the data frame that provides the group size. Here is an example with simulated data that demonstrates this feature and shows the importance of properly accounting for weights.
Data for this example are contained in the data frame weights.df
saved within the file weights-data-1.RData
.
## Load library
library(dalmatian)
## Load simulated data
data(weights_data_1)
head(weights_data_1)
## n x y
## 1 6 -1.00 -0.89
## 2 6 -0.98 -0.71
## 3 7 -0.96 -0.82
## 4 10 -0.94 -1.05
## 5 8 -0.92 -0.61
## 6 20 -0.90 -0.52
The three columns in the data frame record the number of responses per group (n
), the value of the covariate (x
), and the mean response (y
). The data were generated from the model \[Y_{ij} \sim N(x_i,\exp(1+1x_i))\] with \(i=1,\ldots,100\) indexing the groups and \(j\) the observations within groups. In the data, the number of observations per group ranges from 5 to 40.
First we run the model with no weights.
## Mean model
=list(fixed=list(name="alpha",formula=~ x,
mymeanpriors=list(c("dnorm",0,.001))))
## Variance model
=list(fixed=list(name="psi",
mydispformula=~ x,
priors=list(c("dnorm",0,.001))),
link="log")
## Set working directory
## By default uses a system temp directory. You probably want to change this.
<- tempdir()
workingDir
## Define list of arguments for jags.model()
= list(file=file.path(workingDir,"weights_1_jags.R"),n.adapt=1000)
jm.args
## Define list of arguments for coda.samples()
= list(n.iter = 5000, n.thin = 20)
cs.args
## Run the model using dalmatian
<- dalmatian(
results1 df = weights_data_1,
mean.model = mymean,
dispersion.model = mydisp,
jags.model.args = jm.args,
coda.samples.args = cs.args,
response = "y",
overwrite = TRUE,
debug = FALSE)
## Step 1: Generating JAGS data...
## Done
## Step 2: Generating JAGS code...
## Done
## Step 3: Generating initial values...
##
## Running three parallel chains by default...
## Initializing chain 1 ...
## Initializing chain 2 ...
## Initializing chain 3 ...
## Done
## Step 4: Running model
## module glm loaded
## Initializing model
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 300
## Unobserved stochastic nodes: 4
## Total graph size: 1512
##
## Initializing model
##
## Generating samples
## Done
## Step 5: Tidying Output...
## Done
## Numerical summary statistics
<- summary(results1)
summary1 summary1
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## Posterior Summary Statistics for Each Model Component
##
## Mean Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.02 -0.02 0.05 -0.12 -0.06 0.02 0.08
## x 0.87 0.87 0.08 0.71 0.82 0.93 1.03
##
## Dispersion Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -1.5 -1.5 0.14 -1.78 -1.6 -1.4 -1.2
## x 1.2 1.2 0.27 0.67 1.0 1.4 1.7
## Graphical summaries
<- caterpillar(results1, show = TRUE) caterpillar1
From the summaries we can see that that the intercept in the dispersion model is being underestimated. The true value is 1, but the posterior mean is -1.5 with 95% HPD interval (-1.78,-1.22).
We now re-run the model including the weights.
## Specify column containing weights
$weights = "n"
mydisp
## Run model
= list(file=file.path(workingDir,"weights_2_jags.R"),n.adapt=1000)
jm.args
= dalmatian(df=weights_data_1,
results2 mean.model=mymean,
dispersion.model=mydisp,
jags.model.args=jm.args,
coda.samples.args=cs.args,
response="y",
overwrite = TRUE,
debug=FALSE)
## Step 1: Generating JAGS data...
## Done
## Step 2: Generating JAGS code...
## Done
## Step 3: Generating initial values...
##
## Running three parallel chains by default...
## Initializing chain 1 ...
## Initializing chain 2 ...
## Initializing chain 3 ...
## Done
## Step 4: Running model
## Initializing model
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 300
## Unobserved stochastic nodes: 4
## Total graph size: 1612
##
## Initializing model
##
## Generating samples
## Done
## Step 5: Tidying Output...
## Done
## Numerical summary statistics
<- summary(results2)
summary2 summary2
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## Posterior Summary Statistics for Each Model Component
##
## Mean Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.04 -0.04 0.05 -0.13 -0.07 0.00 0.05
## x 0.88 0.89 0.08 0.73 0.83 0.93 1.04
##
## Dispersion Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) 1.1 1.1 0.14 0.80 0.97 1.2 1.4
## x 1.2 1.2 0.27 0.63 1.01 1.4 1.7
## Graphical summaries
<- caterpillar(results2, show = TRUE) caterpillar2
The new output shows that the estimate of the intercept for the variance model, 1.08, is now very close to the truth and the 95% credible interval, (0.8,1.36) easily covers the true value of 1.