2. Fitting models and making predictions

If interested in the theory involving this calculations, take a look at this vignette.

library(smile)
library(ggplot2)
library(sf)
#> Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0; sf_use_s2() is TRUE

Considering the same setup used in the vignette about converting sf to spm objects, we are going to predict the Life Expecation at Birth (LEF) score (observed at the Middle Super Output Areas (MSOA), in Liverpool, into the Lower Super Output Area (LSOA). In this second level, we have the Life Expectancy at Birth for each area. This data was taken from Johnson, Diggle, and Giorgi (2020).

The code below loads the data and converts the LSOA sf object into a spm object. In particular, we are going to use numerical integration (based on a 500 points grid) with the precision of the integrals varying according to the size of each MSOA region. The name of the variable we are interested in, in the liv_msoa, is "leb_est".

data(liv_msoa)
data(liv_lsoa)

## workaround for compatibility with different PROJ versions
st_crs(liv_msoa) <-
    st_crs(liv_msoa)$input

st_crs(liv_lsoa) <-
    st_crs(liv_lsoa)$input

## msoa from sf to spm
spm_msoa <-
    sf_to_spm(sf_obj = liv_msoa,
              n_pts  = 500,
              type   = "regular",
              by_polygon = FALSE,
              poly_ids = "msoa11cd",
              var_ids  = "leb_est")

Next, we assume the following model, at the grid level, drives the data¹ \[ Y(\mathbf{s}) = \mu + S(\mathbf{s}) + \varepsilon(\mathbf{s}), \] where \(S(\cdot) \sim {\rm GP}(0, {\rm C}(\cdot ; \, \boldsymbol{\theta}))\), and \(\varepsilon(\cdot) \overset{{\rm i.i.d.}}{\sim} {\rm N}(0, \tau)\), with \(S(\cdot) \perp \varepsilon(\cdot)\). The form of the covariance functions considered here are written as follows \({\rm C}(\cdot ; \, \boldsymbol{\theta} ) = \sigma^2 \rho(\cdot ; \, \boldsymbol{\theta})\). Where \(\rho(\cdot ; \, \boldsymbol{\theta})\) is a stationary and isotropic correlation function. To check the families of correlation functions available in our package, check this link. Another important detail is that, if we make \(\nu = \tau / \sigma^2\), the estimation process is slightly easier. The package allows for the two parametrizations.

For this problem we are going to consider the the Matérn covariance function. The model will be fitted with \(\nu = .5, 1.5, 2.5\), this the parameter controls the smoothness of the spatial process. Also, for simplicity, we are going to ignore \(\tau\). Next, we evaluate the AIC associated with three models. The function fit_spm fits the models and needs starting values for \(\phi\) (and \(\nu\) or \(\tau\), if you decide not to ignore the small scale variation), the covariance model to be used is informed as a string in the argument model, the available options are c("matern", "pexp", "gaussian", "spherical"). Notice that, when not inputting an initial value for \(\nu\) (or \(\tau\)), we are forcing this parameter to be 0. Also, the theta_st argument (which takes the initial values for the parameters), must be a named vector. If we input initial values for \(\mu\), \(\sigma^2\), \(\phi\), and \(\nu\) (or \(\tau\)), the likelihood is optimized numerically for all the parameters. If we omit \(\mu\) and \(\sigma^2\), then a profile likelihood approach is used to find numerically for the \(\hat{\phi}\) and, if wished, \(\hat{\nu}\). While, \(\hat{\mu}\) and \(\hat{\sigma}^2\) have closed form expressions (Diggle and Ribeiro 2007). Additionally, the commented part of the code below shows how to proceed to estimate \(\nu\), an equivalently \(\tau\), from the data. The argument apply_exp is a workaround for parameters that cannot be negative. The arguments opt_method and control are passed to the function optim in order to optimize the log likelihood associated with the data. opt_method controls the optimization algorithm to be used, while control inputs control arguments for such optimization algorithms, for more details see optim.

theta_st_msoa <- c("phi" = 1)

## 1) it is important to NAME the initial values for each parameter
## 2) to estimate "nu" from the data we only need to provide an initial value for such
## parameter
## 3) uncomment the code below to do so.
## 4) Note that it is possible to set the boundaries for the parameter space on
## which we want to optmize the likelihood.
## theta_st_msoa <- c("phi" = 1, "nu" = 1)
## fit_msoa1 <-
##     fit_spm(x = spm_msoa,
##             theta_st = theta_st_msoa,
##             model = "matern",
##             nu = .5,
##             lower = c(1e-16, 1e-16),
##             upper = c(Inf, 1),
##             opt_method = "L-BFGS-B",
##             control   = list(maxit = 500))

fit_msoa1 <-
    fit_spm(x = spm_msoa,
            theta_st = theta_st_msoa,
            model = "matern",
            nu = .5,
            apply_exp = TRUE,
            opt_method = "L-BFGS-B",
            control   = list(maxit = 500))

fit_msoa2 <-
    fit_spm(x = spm_msoa,
            theta_st = theta_st_msoa,
            model = "matern",
            nu = 1.5,
            apply_exp = TRUE,
            opt_method = "L-BFGS-B",
            control    = list(maxit = 500))

fit_msoa3 <-
    fit_spm(x = spm_msoa,
            theta_st = theta_st_msoa,
            model = "matern",
            nu = 2.5,
            apply_exp = TRUE,
            opt_method = "L-BFGS-B",
            control   = list(maxit = 500))

The AIC² associated with each model is given below. According to this criterion, the best model is the one with \(\nu = .5\).

c("m1" = AIC(fit_msoa1), "m2" = AIC(fit_msoa2), "m3" = AIC(fit_msoa3))
#>       m1       m2       m3 
#> 295.6540 297.0862 297.9339

We can retrieve the estimated parameters and the associated (1 - sig) % confidence intervals by using the function summary_spm_fit as follows

summary_spm_fit(fit_msoa1, sig = .05)
#> 
#>  optimization algorithm converged: yes 
#> 
#>     par   estimate        se               ci
#> 1    mu 75.9400123 0.8168155 [74.339; 77.541]
#> 2 sigsq 22.7210189 4.1142154 [14.657; 30.785]
#> 3   phi  0.8511316 0.2196581   [0.421; 1.282]

Finally, we have almost everything we need to perform predictions at the LSOA areas. The function predict_spm calculates the predictions associated to the spm_obj (which contains the results associated with a fitted model) into a sf object given as an input for the argument x. We need to specify whether we want to “aggregate” (i.e. average) the predicted surface over the polygons associated with the sf data. The number of points and the type of integration for prediction need to be set as well. In this case we are specifying a finer grid (3500 points) because the LSOA areas are smaller than the MSOA areas.

pred_lsoa <- predict_spm(x = liv_lsoa, spm_obj = fit_msoa1, id_var = "lsoa11cd")
#> Warning in st_centroid.sf(x = x[empty_polys, id_var]): st_centroid assumes
#> attributes are constant over geometries of x

The resulting object is a list with entries "mu_pred", "sig_pred", "pred_grid", "pred_agg". The first two positions correspond to the mean and covariance matrix at the predicted locations, respectively. "pred_grid" can be seen as the predicted surface over the study region, while "pred_agg" contains the (integrals) averages of such surface within each LSOA area. The chunk of code below plots the predicted life expectancy at the LSOA areas.

ggplot(data = pred_lsoa$pred_agg,
       aes(fill  = mu_pred)) +
    geom_sf(color = 1,
            lwd = .1) +
    scale_fill_viridis_c(option = "H") +
    guides(fill = "none") +
    theme_bw() +
    theme(axis.text  = element_blank(),
          axis.ticks = element_blank())

Next, we plot the standard errors associated with the predictions.

ggplot(data = pred_lsoa$pred_agg,
       aes(fill  = se_pred)) +
    geom_sf(color = 1,
            lwd = .1) +
    scale_fill_viridis_c(option = "H") +
    guides(fill = "none") +
    theme_bw() +
    theme(axis.text  = element_blank(),
          axis.ticks = element_blank())

Reference

Diggle, PJ, and Paulo Justiniano Ribeiro. 2007. “Model-Based Geostatistics (Springer Series in Statistics).” In, 51–56. Springer.

Johnson, Olatunji, Peter Diggle, and Emanuele Giorgi. 2020. “Dealing with Spatial Misalignment to Model the Relationship Between Deprivation and Life Expectancy: A Model-Based Geostatistical Approach.” International Journal of Health Geographics 19 (1): 1–13.

To check how the model at the grid level relates to the data at the area level check the Theory vignette ↩
BIC works for our models too.↩