The SUMMER
package offers a class of tools for small-area estimation with survey data in space and time. Such data are usually collected with complex stratified designs, which must be acknowledged in the analysis. In this vignette, we use two examples to illustrate spatial and spatial-temporal smoothing of design-based estimates:
The first BRFSS example may be of more general interest, and provides an introduction to the idea of small area estimation with survey data. The second example of estimating U5MR involves more modeling steps to deal with the composite mortality indicator and data sparsity. More details of applying the methods to real world data are discussed in the package vignette A Case Study of Estimating Subnational U5MR using Smoothed Direct Methods.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone health survey conducted by the Centers for Disease Control and Prevention (CDC) that tracks health conditions and risk behaviors in the United States and its territories since 1984. The BRFSS sampling scheme is complex with high variability in the sampling weights. In this example, we estimate the prevalence of Type II diabetes in health reporting areas (HRAs) in King County, using BRFSS data. We will compare the weighted direct estimates, with simple spatial smoothed estimates (ignore weighting), and the smoothed and weighted estimates.
First, we load the package and the necessary data. INLA is not in a standard repository, so we check if it is available and install it if it is not.
library(SUMMER)
library(survey)
library(ggplot2)
library(patchwork)
if (!isTRUE(requireNamespace("INLA", quietly = TRUE))) {
install.packages("INLA", repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
}
data(BRFSS)
data(KingCounty)
BRFSS
contains the full BRFSS dataset with \(16,283\) observations. The diab2
variable is the binary indicator of Type II diabetes, strata
is the strata indicator and rwt_llcp
is the final design weight. For the purpose of this analysis, we first remove records with missing HRA code or diabetes status from this dataset.
KingCounty
contains the map of the King County HRAs. In order to fit spatial smoothing model, we first need to compute the adjacency matrix for the HRAs, mat
, and make sure both the column and row names correspond to the HRA names.
## [1] 48 48
Let \(y_i\) and \(m_i\) be the number of individuals flagged as having type II diabetes and the denominators in areas \(i = 1, ..., n\). Ignoring the survey design, the naive estimates for the prevalence of Type II diabetes can be easily calculated as \(\hat p_i = y_i / m_i\), with associated standard errors \(\sqrt{\hat p_i(1 - \hat p_i)/m_i}\). The design-based weighted estimates of \(p_i\) and the associated variances can be easily calculated using the survey
package.
design <- svydesign(ids = ~1, weights = ~rwt_llcp, strata = ~strata, data = BRFSS)
direct <- svyby(~diab2, ~hracode, design, svymean)
head(direct)
## hracode diab2 se
## Auburn-North Auburn-North 0.104 0.021
## Auburn-South Auburn-South 0.233 0.049
## Ballard Ballard 0.070 0.022
## Beacon/Gtown/S.Park Beacon/Gtown/S.Park 0.081 0.026
## Bear Creek/Carnation/Duvall Bear Creek/Carnation/Duvall 0.052 0.012
## Bellevue-Central Bellevue-Central 0.059 0.015
When the number of samples in each area is large, the design-based variance is usually small and the direct estimates will work well. However, when we have small sample from each area, we would like to perform some form of smoothing over the areas. For now, let us ignore the survey weights, we can consider the following Bayesian smoothing model:
\[\begin{eqnarray*}
y_i | p_i &\sim& \mbox{Binomial}(m_i,p_i)\\
\theta_i &=& \log \left( \frac{p_i}{1-p_i}\right) = \mu+ \epsilon_i+s_i,\\
\epsilon_i &\sim & N(0,\sigma_\epsilon^2)\\
s_i | s_j, j \in \text{ne}(i) &\sim& N\left(\bar{s_i}, \dfrac{\sigma_s^2}{n_i} \right).
\end{eqnarray*}\] where \(n_i\) is the number of neighbors for area \(i\), \(\bar{s_i} = \frac{1}{n_i}\sum_{j \in \text{ne}(i)} s_j\), and hyperpriors are put on \(\mu,\sigma_\epsilon^2,\sigma_s^2\). This simple smoothing model can be fitted using the smoothSurvey
function by specifying NULL for the survey parameters
The smoothed estimates of \(p_i\) and \(\theta_i\) can be found in the smooth
object returned by the function, and the direct estimates are stored in the HT
object (without specifying survey weights, these are the simple binomial probabilities).
## Length Class Mode
## HT 8 data.frame list
## smooth 11 data.frame list
## smooth.overall 0 -none- NULL
## fit 51 inla list
## CI 1 -none- numeric
## Amat 2304 -none- numeric
## responseType 1 -none- character
## formula 3 formula call
To account for the survey designs in the smoothing, we instead model the logit of the design-based direct estimates \(\hat p_i \sim N(\theta_i, \hat V_i)\) directly, where both \(\hat p_i\) and the asymptotic variance on the logit scale, \(\hat V_i\), are assumed known. This model can be fit with
Again, the design-based direct estimates and the smoothed estimates accounting for design can be obtained by svysmoothed$smooth
and svysmoothed$HT
. We can now compare the three types of estimates and their associated variance and it can be seen that smoothing reduces the variance of estimates significantly.
est <- data.frame(naive = smoothed$HT$HT.est,
weighted = svysmoothed$HT$HT.est,
smooth = smoothed$smooth$mean,
weightedsmooth = svysmoothed$smooth$mean)
var <- data.frame(naive = smoothed$HT$HT.var,
weighted = svysmoothed$HT$HT.var,
smooth = smoothed$smooth$var,
weightedsmooth = svysmoothed$smooth$var)
l1 <- range(est)
l2 <- range(var)
g1 <- ggplot(est, aes(x = naive, y = smooth)) + geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Naive Estimates") + xlab("Direct") +
ylab("Smoothed")+ xlim(l1) + ylim(l1)
g2 <- ggplot(var, aes(x = naive, y = weightedsmooth)) + geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Naive Variance") + xlab("Direct") +
ylab("Smoothed") + xlim(l2) + ylim(l2)
g3 <- ggplot(est, aes(x = weighted, y = weightedsmooth)) + geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Survey Weighted Estimates") + xlab("Direct") +
ylab("Smoothed") + xlim(l1) + ylim(l1)
g4 <- ggplot(var, aes(x = weighted, y = weightedsmooth)) + geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Survey Weighed Variance") + xlab("Direct") +
ylab("Smoothed") + xlim(l2) + ylim(l2)
(g1 + g2) / (g3 + g4)
Covariates at the area level could be included in the smoothing model by specifying the X
argument. The first column of the covariate matrix should be the names of regions that match the column and row names of the adjacency matrix.
Education <- svyby(~educau, ~hracode, design, svymean)
Education <- Education[, 1:5]
svysmoothed_with_covariate <- smoothSurvey(data = BRFSS, geo = KingCounty, Amat = mat, X = Education, responseType = "binary", responseVar="diab2", strataVar="strata", weightVar="rwt_llcp", regionVar="hracode", clusterVar = "~1", CI = 0.95)
svysmoothed_with_covariate$fit$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode
## (Intercept) -2.67 0.071 -2.81 -2.67 -2.54 -2.67
## educauless.than.HS 1.93 3.206 -4.41 1.94 7.95 1.95
## educauHS.grad -1.36 1.350 -4.02 -1.37 1.19 -1.37
## educausome.college 1.59 1.144 -0.66 1.59 3.75 1.58
## educaucollege.grad -0.74 0.507 -1.73 -0.74 0.22 -0.75
## kld
## (Intercept) 2.9e-07
## educauless.than.HS 9.1e-07
## educauHS.grad 1.3e-06
## educausome.college 5.9e-07
## educaucollege.grad 5.5e-06
The smoothSurvey
function has some default hyperprior choices built in using the Penalising Complexity (PC) priors. To change the hyperparameters of the PC prior, we can simply use the pc.u
and pc.alpha
for priors on precision parameters, and pc.u.phi
and pc.alpha.phi
for mixing parameter \(\phi\) in the BYM2 model for spatial effects. The interpretation of PC prior parameters for the precision \(\tau\) is \[
\mbox{Prob}(\sigma > u) = \alpha
\] where \(\sigma = 1 / \sqrt{\tau}\) is the standard deviation. And for the mixing parameter \(\phi\) is \[
\mbox{Prob}(\phi < u) = \alpha
\]
For example, if we change pc.u
and pc.alpha
from the default value \(u = 1, \alpha = 0.01\) to \(u = 0.1, \alpha = 0.01\), we would assign more prior mass on smaller variance of the random effects.
We can visualize one or more metrics on the map is by the mapPlot
function, for example,
toplot <- svysmoothed$smooth
toplot$mean.new <- svysmooth.1$smooth$mean
toplot$var.new <- svysmooth.1$smooth$var
variables <- c("mean", "mean.new",
"var", "var.new")
names <- c("Mean (default prior)", "Mean (new prior)",
"Variance (default prior)", "Variance (new prior)")
g1 <- mapPlot(data = toplot, geo = KingCounty, variables=variables[1:2], labels = names[1:2], by.data = "region", by.geo = "HRA2010v2_", size = 0.1)
g2 <- mapPlot(data = toplot, geo = KingCounty, variables=variables[3:4], labels = names[3:4], by.data = "region", by.geo = "HRA2010v2_", size = 0.1)
g1 / g2
A more complex way to change the random effect model entirely is to use the formula
argument. To use this option, we will need to use the internal indicator for region, region.struct
or region.unstruct
as the index. This index variable name can also be observed from the fitted object by summary(svysmoothed$fit)
or svysmoothed$formula
. For example, we can reassign a different parameterization of the BYM model. For users familiar with INLA syntax, this allows more possibilities of expanding the modeling framework.
formula <- "f(region.struct, model = 'besag', graph = Amat,
constr = TRUE, scale.model = TRUE) +
f(region.unstruct, model = 'iid')"
svysmooth.2 <- smoothSurvey(data = BRFSS, geo = KingCounty, Amat = mat,
responseType = "binary", responseVar="diab2",
strataVar="strata", weightVar="rwt_llcp", regionVar="hracode",
clusterVar = "~1", CI = 0.95,
formula = formula)
DemoData
contains model survey data provided by DHS. Note that this data is fake, and does not represent any real country’s data. Data similar to the DemoData
data used in this vignette can be obtained by using getBirths
. DemoMap
contains geographic data from the 1995 Uganda Admin 1 regions defined by DHS. Data similar to the DemoMap
data used in this vignette can be obtained by using read_shape
.
DemoData
is a list of \(5\) data frames where each row represent one person-month record and contains the \(8\) variables as shown below. Notice that time
variable is turned into 5-year bins from 80-84
to 10-14
.
## Length Class Mode
## 1999 8 data.frame list
## 2003 8 data.frame list
## 2007 8 data.frame list
## 2011 8 data.frame list
## 2015 8 data.frame list
## clustid id region time age weights strata died
## 1 1 1 eastern 00-04 0 1.1 eastern.rural 0
## 2 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 3 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 4 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 5 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 6 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
DemoData
is obtained by processing the raw DHS birth data (in .dta format) in R. The raw file of birth recodes can be downloaded from the DHS website https://dhsprogram.com/data/Download-Model-Datasets.cfm. For this example dataset, no registration is needed. For real DHS survey datasets, permission to access needs to be registered with DHS directly. DemoData
contains a small sample of the observations in this dataset randomly assigned to \(5\) example DHS surveys.
Here we demonstrate how to split the raw data into person-month format from. Notice that to read the file from early version of stata, the package readstata13
is required. The following script is based on the example dataset ZZBR62FL.DTA
available from the DHS website. We use the interaction of v024 and v025 as the strata indicator for the purpose of demonstration. We can see from range(data$v008)
that the CMC code for the date of interview corresponds to the year 1900 + floor(1386/12)= 2015
. In practice, however, the survey year is usually known. The survey year variable allows the mis-recorded data. Dates after the surveyyear
will be removed. Thus for survey taking place over multiple years, the later year is suggested to be used as surveyyear
. If set to NA then no checking will be performed.
library(readstata13)
my_fp <- "data/ZZBR62DT/ZZBR62FL.DTA"
dat <- getBirths(filepath = my_fp, surveyyear = 2015, strata = c("v024", "v025"))
dat <- dat[,c("v001","v002","v024","per5","ageGrpD","v005","strata","died")]
colnames(dat) <- c("clustid","id","region","time","age","weights","strata","died")
Next, we obtain Horvitz-Thompson estimators using getDirectList
. We specify the survey design as the two-stage stratified cluster sampling where strata are specified in the strata
column, and clusters are specified by both the cluster ID (clusterid
) and household ID (id
).
years <- levels(DemoData[[1]]$time)
data_multi <- getDirectList(births = DemoData, years = years,regionVar = "region",
timeVar = "time", clusterVar = "~clustid+id", ageVar = "age", weightsVar = "weights", geo.recode = NULL)
Before fitting the model, we also aggregate estimates based on different surveys into a single set of estimates, using the inverse design-based variances as the weights.
## [1] 150 11
## [1] 30 10
With the combined direct estimates, we are ready to fit the smoothing models. First, we ignore the subnational estimates, and fit a model with temporal random effects only. In this part, we use the subset of data region variable being “All”. In fitting this model, we first define the list of time periods we wish to project the estimates on. First we can fit a Random Walk 2 only model defined on the 5-year period. The argument m = 1
specifies that the random walk is in the same temporal resolution as the input data. See the next example for the case where the random walk model is specified at a higher resolution.
years.all <- c(years, "15-19")
fit1 <- smoothDirect(data = data, Amat = NULL, year_label = years.all,
year_range = c(1985, 2019), time.model = "rw2", m = 1)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Temporal resolution: period model (m = 1)
## ----------------------------------
## ----------------------------------
## Smoothed Direct Model
## Main temporal model:
## Temporal resolution: period model (m = 1)
## ----------------------------------
## Fixed Effects
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) -1.4 0.078 -1.6 -1.4 -1.3 -1.4 0
## ----------------------------------
## Random Effects
## Name Model
## 1 time.struct RW2 model
## 2 time.unstruct IID model
## ----------------------------------
## Model hyperparameters
## mean sd 0.025quant 0.5quant 0.97quant
## Precision for time.struct 1957 31708 7.3 164 10130
## Precision for time.unstruct 1862 19135 11.6 228 10370
## mode
## Precision for time.struct 14
## Precision for time.unstruct 24
## [,1]
## Expectected number of parameters 3.71
## Stdev of the number of parameters 0.92
## Number of equivalent replicates 1.62
## [,1]
## log marginal-likelihood (integration) -1.3
## log marginal-likelihood (Gaussian) -1.9
When data is sparse, direct estimates at yearly level may be unstable. This is why we used 5-year periods as the model’s temporal resolution in this example. When performing temporal smoothing, however, we can define the temporal smoother on the yearly scale instead. Notice that the direct estimates are still calculated in 5-year intervals, but the smoothed estimator now produce estimates at both yearly and period resolutions.
fit2 <- smoothDirect(data = data, Amat = NULL, year_label = years.all,
year_range = c(1985, 2019), time.model = "rw2", m = 5, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Temporal resolution: yearly model (m = 5)
## ----------------------------------
## ----------------------------------
## Smoothed Direct Model
## Main temporal model:
## Temporal resolution: yearly model (m = 5)
## ----------------------------------
## Fixed Effects
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) -1.4 0.071 -1.6 -1.4 -1.3 -1.4 0
## ----------------------------------
## Random Effects
## Name Model
## 1 time.struct RGeneric2
## 2 time.unstruct RGeneric2
## ----------------------------------
## Model hyperparameters
## mean sd 0.025quant 0.5quant 0.97quant mode
## Theta1 for time.struct 5.3 1.9 1.9 5.1 9.2 4.6
## Theta1 for time.unstruct 4.6 1.8 1.4 4.4 8.2 3.9
## [,1]
## Expectected number of parameters 3.48
## Stdev of the number of parameters 0.82
## Number of equivalent replicates 1.73
## [,1]
## log marginal-likelihood (integration) -74
## log marginal-likelihood (Gaussian) -75
The marginal posteriors are already stored in the fitted object. We use the following function to extract and re-arrange them.
We can compare the results visually using the function below.
Now we fit the full model on all subnational regions. First, we use the Random Walk 2 model defined on the 5-year period.
fit3 <- smoothDirect(data = data, Amat = mat, year_label = years.all, year_range = c(1985,
2019), time.model = "rw2", m = 1, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Temporal resolution: period model (m = 1)
## Spatial effect: bym2
## Interaction temporal model: rw2
## Interaction type: 4
## ----------------------------------
Similarly we can also estimate the Random Walk 2 random effects on the yearly scale.
fit4 <- smoothDirect(data = data, Amat = mat, year_label = years.all, year_range = c(1985,
2019), time.model = "rw2", m = 5, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Temporal resolution: yearly model (m = 5)
## Spatial effect: bym2
## Interaction temporal model: rw2
## Interaction type: 4
## ----------------------------------
The figures below shows the comparison of the subnational model with different temporal scales.
g1 <- plot(out3, is.yearly = FALSE) + ggtitle("Subnational period model")
g2 <- plot(out4, is.yearly = TRUE) + ggtitle("Subnational yearly model")
g1 + g2
We can also add back the direct estimates for comparison when plotting the smoothed estimates.
plot(out4, is.yearly = TRUE, data.add = data_multi, option.add = list(point = "mean",
by = "surveyYears")) + facet_wrap(~region, scales = "free")
Finally, we show the estimates over time on maps.