This vignette is organized as follows:
Getting started
Exploring the data
Model specification
Parameter estimation
Robust prediction of the area-level means
Mean square prediction error
In small area estimation (SAE), we distinguish two types of models:
The classification of the models (A or B) is from Rao (2003, Chapter 7). The current version of the package implements the following estimation methods under the unit-level model (model B):
The package can be installed from CRAN using install.packages("rsae")
. Once the rsae
package has been installed, we need to load it to the current session by library("rsae")
.
Work flow
The prediction of the area-level means takes three steps.
saemodel()
,fitsaemodel()
,robpredict()
; this step includes the computation of the mean square prediction error.We use the landsat
data of Battese et al. (1988), which is loaded by
> data("landsat")
The landsat
data is a compilation of survey and satellite data from Battese et al. (1988). It consists of data on segments (primary sampling unit; 1 segement approx. 250 hectares) under corn and soybeans for 12 counties in north-central Iowa.
In the three smallest counties (Cerro Gordo, Hamilton, and Worth), data is available only for one sample segment. All other counties have data for more than one sample segment (i.e., unbalanced data). The largest area (Hardin) covers six units.
The data for the observations 32, 33, and 34 are shown below
> landsat[32:34,]
SegmentsInCounty SegementID HACorn HASoybeans PixelsCorn PixelsSoybeans32 556 1 88.59 102.59 220 262
33 556 2 88.59 29.46 340 87
34 556 3 165.35 69.28 355 160
MeanPixelsCorn MeanPixelsSoybeans outlier CountyName32 325.99 177.05 FALSE Hardin
33 325.99 177.05 TRUE Hardin
34 325.99 177.05 FALSE Hardin
We added the variable outlier
to the original data. It flags observation 33 as an outlier, which is in line with the discussion in Battese et al. (1988).
We consider estimating the parameters of the basic unit-level model (Battese et al. , 1988)
\[ \begin{equation*} \mathrm{HACorn}_{i,j} = \alpha + \beta_1 \mathrm{PixelsCorn}_{i,j} + \beta_2 \mathrm{PixelsSoybeans}_{i,j} + u_i + e_{i,j}, \end{equation*} \]
where \(j=1,\ldots, n_i\), \(i=1, \ldots,12\), and
The model is defined with the help of thesaemodel()
function:
> bhfmodel <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans,
+ area = ~ CountyName,
+ data = subset(landsat, subset = (outlier == FALSE)))
where
formula
defines the fixed-effect part of the model (the ~
operator separates dependent and independent variables; by default, the model includes a regression intercept),area
specifies the area-level random effect (variable CountyName
serves as area identifier; note that the argument area
is also a formula
object),data
specifies the data.frame
(here, we consider the subset of observations that are not flagged as outliers).If you need to know more about a particular model, you can use the summary()
method.
Having specified bhfmodel
, we consider estimating its parameters by different estimation method.
The maximum likelihood (ML) estimates of the parameters are computed by
> mlfit <- fitsaemodel(method = "ml", bhfmodel)
On print, object mlfit
shows the following.
> mlfit
-MODEL (model type B)
ESTIMATES OF SAE: Maximum likelihood estimation
Method---
Fixed effects: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Model
:
Coefficients
(Intercept) PixelsCorn PixelsSoybeans 50.9676 0.3286 -0.1337
---
Random effects
: ~1| CountyName
Model
(Intercept) Residual11.00 11.72
Std. Dev. ---
: 36
Number of Observations: 12 Number of Areas
Inferential statistics for the object mlfit
are computed by the summary()
method.
> summary(mlfit)
ESTIMATION SUMMARY: Maximum likelihood estimation
Method---
Fixed effects-value df p-value
Value Std.Error t50.96756 23.47509 2.17113 22 0.0410 *
(Intercept) 0.32858 0.04798 6.84772 22 7.05e-07 ***
PixelsCorn -0.13371 0.05306 -2.51984 22 0.0195 *
PixelsSoybeans ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Signif. codes
Function coef()
extracts the estimated coefficients from a fitted model.
coef()
is called with argument type = "both"
, which implies that the fixed effects and the random effect variances are returned.type = "ranef"
or type = "fixef"
returns the random effect variances or the fixed effects, respectively.Function convergence()
can be useful if the algorithm did not converge; see Appendix.
The Huber-type M-estimator is appropriate for situations where the response variable is supposed to be (moderately) contaminated by outliers. The M-estimator downweights residual outliers, but it does not limit the effect of high-leverage observations.
The Huber-type M-estimator of bhfmodel
is
> huberfit <- fitsaemodel("huberm", bhfmodel, k = 1.5)
where k
is the robustness tuning constant of the Huber \(\psi\)-function (\(0<k < \infty\)). On print, we have
> huberfit
-MODEL (model type B)
ESTIMATES OF SAE: Huber-type M-estimation
Method: k = 1.5
Robustness tuning constant---
Fixed effects: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Model
:
Coefficients
(Intercept) PixelsCorn PixelsSoybeans 50.2849 0.3285 -0.1392
---
Random effects
: ~1| CountyName
Model
(Intercept) Residual11.95 12.36
Std. Dev. ---
: 36
Number of Observations: 12 Number of Areas
If the algorithm did not converge, see paragraph safe mode (below) and Appendix. Inferential statistics for huberfit
are computed by the summary()
method.
> summary(huberfit)
ESTIMATION SUMMARY: Huber-type M-estimation
Method: k = 1.5
Robustness tuning constant---
Fixed effects-value df p-value
Value Std.Error t50.28489 24.84651 2.02382 22 0.0553 .
(Intercept) 0.32845 0.05077 6.46906 22 1.65e-06 ***
PixelsCorn -0.13916 0.05618 -2.47711 22 0.0214 *
PixelsSoybeans ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes---
/winsorization:
Degree of downweighting
sum(wgt)/n
0.9841
fixeff 0.9723
residual var 0.9841 area raneff var
The output is separated into 2 blocks. The first block shows inferential statistics of the estimated fixed effects. The second block reports the degree of downweighting that is applied to outlying residuals at the final iteration (by estimating equations, EE, separately). The more the value of “sum(wgt)/n” deviates from 1.0, the more downweighting has been applied.
The methods coef()
and convergence()
are also available.
In the safe mode, the algorithm is initialized by a high-breakdown-point regression estimator. Note. In order to use the safe mode, the robustbase
package of Maechler et al. (2021) must be installed.
The safe mode is entered by specifying one the following initialization methods:
init = "lts"
: least trimmed squares (LTS) regression estimator; see Rousseeuw (1984) and Rousseeuw and van Driessen (2006),init = "s"
: regression S-estimator; see Rousseeuw and Yohai (1984) and Salibian-Barerra and Yohai (2006)in the call of fitsaemodel()
. The safe mode uses a couple of internal tests to check whether the estimates at consecutive iterations behave well. Notably, it tries to detect cycles in the sequence iteratively refined estimates; see Schoch (2012).
The package implements the following methods to predict the area-level means:
EBLUP and robust predictions are computed by
> robpredict(fit, areameans = NULL, k = NULL, reps = NULL, progress_bar = TRUE)
where
fit
is a fitted model (ML estimate or M-estimate),k
is the robustness tuning constant of the Huber \(\psi\)-function for robust prediction. By default k
is NULL
which means that the procedure inherits the tuning constant k
that has been used in fitting the model; see fitsaemodel()
. For the ML estimator, \(k\) is taken as a large value that “represents” infinity; see argument k_Inf
in fitsaemodel.control()
.reps
and progress_bar
are used in the computation of the mean square prediction error (see below).In the landsat
data, the county-specific population means of pixels of the segments under corn and soybeans are recorded in the variables MeanPixelsCorn and MeanPixelsSoybeans, respectively. Each sample segment in a particular county is assigned the county-specific means. Therefore, the population mean of the variables MeanPixelsCorn and MeanPixelsSoybeans occurs \(n_i\) times. The unique county-specific population means are obtained using
> dat <- unique(landsat[-33, c("MeanPixelsCorn", "MeanPixelsSoybeans", "CountyName")])
> dat <- cbind(rep(1,12), dat)
> rownames(dat) <- dat$CountyName
> dat <- dat[,1:3]
> dat
rep(1, 12) MeanPixelsCorn MeanPixelsSoybeans
1 295.29 189.70
Cerro Gordo 1 300.40 196.65
Hamilton 1 289.60 205.28
Worth 1 290.74 220.22
Humboldt 1 318.21 188.06
Franklin 1 257.17 247.13
Pocahontas 1 291.77 185.37
Winnebago 1 301.26 221.36
Wright 1 262.17 247.09
Webster 1 314.28 198.66
Hancock 1 298.65 204.61
Kossuth 1 325.99 177.05 Hardin
The first column of dat
is a dummy variable (because bhfmodel
has a regression intercept). The second and third column represent, respectively, the county-specific population means of segments under corn and soybeans.
Consider the ML estimate, mlfit
, of the bhfmodel
model. The EBLUP of the area-level means is computed by
> pred <- robpredict(mlfit, areameans = dat)
> pred
/Predicted Area-Level Means:
Robustly Estimated
raneff fixeff area mean-0.348 122.629 122.281
Cerro Gordo 2.731 123.379 126.110
Hamilton -11.522 118.677 107.154
Worth -8.313 117.053 108.741
Humboldt 13.641 130.380 144.021
Franklin 9.529 102.425 111.954
Pocahontas -9.043 122.052 113.009
Winnebago 1.648 120.358 122.006
Wright 11.082 104.073 115.155
Webster -3.229 127.671 124.442
Hancock -14.621 121.740 107.119
Kossuth 8.445 134.408 142.853 Hardin
because (by default) k = NULL
. By explicitly specifying k
in the call of robpredict()
, we can—in principle—compute robust predictions for the ML estimate instead of the EBLUP.
Consider the Huber M-estimate huberfit
(with tuning constant k = 1.5
, see call of above). If k
is not specified in the call of robpredict()
, robust predictions are computed with \(k\) equal to 1.5; otherwise, the robust predictions are based on the value of k
in the call of robpredict()
.
Object pred
is a list
with slots "fixeff"
, "raneff"
, "means"
, "mspe"
, etc. For instance, the predicted means can be extracted by pred$means
or pred[["means"]]
.
A plot()
function is available to display the predicted means.
Function residuals()
computes the residuals.
Function robpredict()
can be used to compute bootstrap estimates of the mean squared prediction errors (MSPE) of the predicted area-level means; see Sinha and Rao (2009). To compute the MSPE, we must specify the number of bootstrap replicates (reps)
. If reps = NULL
, the MSPE is not computed.
Consider (for instance) the ML estimate. EBLUP and MSPE of the EBLUP based on 100 bootstrap replicates are computed by
> pred <- robpredict(mlfit, areameans = dat, reps = 100,
+ progress_bar = FALSE)
> pred
/Predicted Area-Level Means:
Robustly Estimated
raneff fixeff area mean MSPE -0.348 122.629 122.281 82.418
Cerro Gordo 2.731 123.379 126.110 99.031
Hamilton -11.522 118.677 107.154 60.627
Worth -8.313 117.053 108.741 64.231
Humboldt 13.641 130.380 144.021 39.708
Franklin 9.529 102.425 111.954 39.070
Pocahontas -9.043 122.052 113.009 43.537
Winnebago 1.648 120.358 122.006 42.533
Wright 11.082 104.073 115.155 32.976
Webster -3.229 127.671 124.442 32.567
Hancock -14.621 121.740 107.119 28.018
Kossuth 8.445 134.408 142.853 20.681
Hardin : 100 boostrap replicates) (MSPE
The number of reps = 100
is kept small for illustration purposes; in practice, we should choose larger values. The progress bar has been disabled as it is not suitable for the automatic vignette generation.
A visual display of the predicted area-level means obtains by plot(pred, type = "l", sort = "means")
.
COPT, S. and M.-P. VICTORIA-FESER (2009). Robust prediction in mixed linear models, Tech. rep., University of Geneva.
BATTESE, G. E., R. M. HARTER and W. A. FULLER (1988). An error component model for prediction of county crop areas using, Journal of the American Statistical Association 83, 28–36. DOI: 10.2307/2288915
FELLNER, W. H. (1986). Robust estimation of variance components, Technometrics 28, 51–60. DOI: 10.1080/00401706.1986.10488097
HERITIER, S., E. CANTONI, S. COPT, and M.-P. VICTORIA-FESER (2009). Robust Methods in Biostatistics, New York: John Wiley & Sons.
MAECHLER, M., P. J. ROUSSEEUW, C. CROUX, V. TODOROC, A. RUCKSTUHL, M. SALIBIAN-BARRERA, T. VERBEKE, M. KOLLER, M. KOLLER, E. L. T. CONCEICAO and M. A. DI PALMA (2021). robustbase: Basic Robust Statistics. R package version 0.93-9. URL: CRAN.R-project.org/package=robustbase.
RAO, J.N.K. (2003). Small Area Estimation, New York: John Wiley and Sons.
RICHARDSON, A. M. and A. H. WELSH (1995). Robust Restricted Maximum Likelihood in Mixed Linear Models, Biometrics 51, 1429–1439. DOI: 10.2307/2533273
ROUSSEEUW, P. J. (1984). Least Median of Squares Regression, Journal of the American Statistical Association 79, 871–880. DOI: 10.2307/2288718
ROUSSEEUW, P. J. and K. VAN DRIESSEN (2006). Computing LTS Regression for Large Data Sets, Data Mining and Knowledge Discovery 12, 29–45. DOI: 10.1007/s10618-005-0024-4
ROUSSEEUW, P. J. and V. YOHAI (1984). Robust Regression by Means of S Estimators, in Robust and Nonlinear Time Series Analysis, ed. by FRANKE, J., W. HÄRDLE AND R. D. MARTIN, New York: Springer, 256–274.
SALIBIAN-BARRERA, M. and V. J. YOHAI (2006). A Fast Algorithm for S-Regression Estimates, Journal of Computational and Graphical Statistics 15, 414–427. DOI: 10.1198/106186006x113629
SCHOCH, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. URL: www.stat.tugraz.at/AJS/ausg124/124Schoch.pdf
SINHA, S.K. and J.N.K. RAO (2009). Robust small area estimation. Canadian Journal of Statistics 37, 381–399. DOI: 10.1002/cjs.10029
Suppose that we computed
> est <- fitsaemodel("ml", bhfmodel, niter = 3)
The algorithm did not converge and we get the following output
> est
!
THE ALGORITHM DID NOT CONVERGE---
1) use convergence() of your fitted model to learn more
2) study the documentation using the command ?fitsaemodel
3) you may call fitsaemodel with 'init' equal to (either) 'lts'
's' (this works also for ML, though it may no be very efficient)
or 4) if it still does not converge, the last resort is to modify
'acc' and/or 'niter'
-MODEL (model type B)
ESTIMATES OF SAE: Maximum likelihood estimation
Method---
Fixed effects: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Model
:
Coefficients
(Intercept) PixelsCorn PixelsSoybeans NA NA NA
---
Random effects
: ~1| CountyName
Model
(Intercept) ResidualNA NA
Std. Dev. ---
: 36
Number of Observations: 12 Number of Areas
To learn more, we call convergence(est)
and get
> convergence(est)
CONVERGENCE REPORT: ALGORITHM DID NOT CONVERGE!
NOTE---
iterations (niter) and
User specified number of precision (acc):
numeric
niter acc3 1e-05
overall loop 200 1e-05
fixeff 200 1e-05
residual var 100 1e-05
area raneff var ---
-specific iterations in each call
Number of EE-defined specs), reported for
(given the user:
each of the3overall iterations
fixeff residual var area raneff var1 2 2 18
2 2 2 11
3 2 2 12
Clearly, we have deliberately set the number of (overall or outer loop) iterations equal to niter = 3
to illustrate the behavior; see also “niter = 3” on the line “overall loop” in the above table. As a consequence, the algorithm failed to converge.
The maximum number of iterations for the fixed effects (“fixeff”) is 200, for the residual variance estimator (“residual var”) is 200, for the area-level random effect variance is (“area raneff var”) is 100. The default values can be modified; see documentation of fitsaemodel.control()
.
The last table in the above output shows the number of iterations that the algorithm required for the niter = 3
overall loop iteration (i.e., in the first loop, we have fixef = 2
, residual var = 2
and area raneff var = 18
iterations). From this table we can learn how to adjust the default values in case the algorithm does not converge.