The Basics of sfdep

library(sfdep)
library(dplyr)
library(ggplot2)

The goal of this vignette is to familiarize you with the basics of sfdep:

Intro / what is spatial relationship

sfdep provides users with a way to conduct “Exploratory Spatial Data Analysis”, or ESDA for short. ESDA differs from typical exploratory data analysis in that we are strictly exploring spatial relationships. As you might have guessed, ESDA evaluates whether the phenomena captured in your data are dependent upon space–or are spatially auto-correlated. Much of ESDA is focused on “Local Indicators of Spatial Association”, LISAs for short. LISAs are measures that are developed to identify whether some observed pattern is truly random or impacted by its relationship in space.

Much of the philosophy of LISAs and ESDA are captured in Tobler’s First Law of Geography

“Everything is related to everything else. But near things are more related than distant things.” - Waldo R. Tobler, 1969

It’s tough to state this any more simply. Things that are next to each other tend to be more similar than things that are further away.

To assess whether near things are related and further things less so, we typically lattice data. A lattice is created when a landscape or region is divided into sub-areas. Most naturally, these types of data are represented as vector polygons.

Neighbors

To describe neighbors I’m going to steal extensively from my own post “Understanding Spatial Autocorrelation”.

If we assume that there is a spatial relationship in our data, we are taking on the belief that our data are not completely independent of each other. If nearer things are more related, then census tracts that are close to each other will have similar values.

In order to evaluate whether nearer things are related, we must know what observations are nearby. With polygon data we identify neighbors based on their contiguity. To be contiguous means to be connected or touching—think of the contiguous lower 48 states.

Contiguities

The two most common contiguities are based on the game of chess. Let’s take a simple chess board.

In chess each piece can move in a different way. All pieces, with the exception of the knight, move either diagonally or horizontally and vertically. The most common contiguities are queen and rook contiguities. In chess, a queen can move diagonally and horizontal and vertically whereas a rook can only move horizontal and vertically.

We extend this idea to polygons. Queen contiguities identify neighbors based on any polygon that is touching. With rook contiguities, we identify neighbors based on polygons that touch on the side. For most social science research, we only need to be concerned with queen contiguities.

While a chess board might make intuitive sense, geographies are really wonky in real life. Below is map of the 47th observation in the guerry object and it’s queen contiguity neighbors.

You can see that any polygon that is touching, even at a corner, will be considered a neighbor to the point in question. This will be done for every polygon in our data set.

Understanding the spatial weights

Once neighbors are identified, they can then be used to calculate spatial weights. The typical method of calculating the spatial weights is through row standardization (st_weights(nb, style = "W")). Each neighbor that touches our census tract will be assigned an equal weight. We do this by assigning each neighbor a value of 1 then dividing by the number of neighbors. If we have 5 neighboring census tracts, each of them will have a spatial weight of 0.2 (1 / 5 = 0.2).

Going back to the chess board example, we can take the position d4 and look at the queen contiguities. There are 8 squares that immediately touch the square. Each one of these squares is considered a neighbor and given a value of 1. Then each square is divided by the total number or neighbors, 8.

Very simply it looks like the following

(d4_nbs <- rep(1, 8))
#> [1] 1 1 1 1 1 1 1 1

d4_nbs / length(d4_nbs)
#> [1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

Creating Neighbors and Weights

sfdep utilizes list objects for both neighbors and weights. The neighbors and weights lists.

To identify contiguity-based neighbors, we use st_contiguity() on the sf geometry column. And to calculate the weights from the neighbors list, we use st_weights() on the resultant neighbors list. By convention these are typically called nb and wt.

These lists can be created line by line or within a pipe. The most common usecase is likely via a dplyr pipeline.

guerry_nb <- guerry %>% 
  mutate(nb = st_contiguity(geometry),
         wt = st_weights(nb),
         .before = 1) # to put them in the front

guerry_nb
#> Simple feature collection with 85 features and 28 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS:           NA
#> # A tibble: 85 × 29
#>    nb        wt     code_dept count ave_i…¹  dept region depar…² crime…³ crime…⁴
#>  * <nb>      <list> <fct>     <dbl>   <dbl> <int> <fct>  <fct>     <int>   <int>
#>  1 <int [4]> <dbl>  01            1      49     1 E      Ain       28870   15890
#>  2 <int [6]> <dbl>  02            1     812     2 N      Aisne     26226    5521
#>  3 <int [6]> <dbl>  03            1    1418     3 C      Allier    26747    7925
#>  4 <int [4]> <dbl>  04            1    1603     4 E      Basses…   12935    7289
#>  5 <int [3]> <dbl>  05            1    1802     5 E      Hautes…   17488    8174
#>  6 <int [7]> <dbl>  07            1    2249     7 S      Ardeche    9474   10263
#>  7 <int [3]> <dbl>  08            1   35395     8 N      Ardenn…   35203    8847
#>  8 <int [3]> <dbl>  09            1    2526     9 S      Ariege     6173    9597
#>  9 <int [5]> <dbl>  10            1   34410    10 E      Aube      19602    4086
#> 10 <int [5]> <dbl>  11            1    2807    11 S      Aude      15647   10431
#> # … with 75 more rows, 19 more variables: literacy <int>, donations <int>,
#> #   infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> #   commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> #   donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>,
#> #   prostitutes <int>, distance <dbl>, area <int>, pop1831 <dbl>,
#> #   geometry <MULTIPOLYGON>, and abbreviated variable names ¹​ave_id_geo,
#> #   ²​department, ³​crime_pers, ⁴​crime_prop
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Calculating LISAs

To calculate LISAs we typically will provide a numeric object(s), a neighbor list, and a weights list–and often the argument nsim to determine the number of simulations to run. Most LISAs return a data frame of the same number of rows as the input dataframe. The resultant data frame can be unnested, or columns hoisted for ease of analysis.

For example to calculate the Local Moran we use the function local_moran()

lisa <- guerry_nb %>% 
  mutate(local_moran = local_moran(crime_pers, nb, wt, nsim = 199),
         .before = 1)

lisa
#> Simple feature collection with 85 features and 29 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS:           NA
#> # A tibble: 85 × 30
#>    local_moran$ii nb    wt    code_…¹ count ave_i…²  dept region depar…³ crime…⁴
#>  *          <dbl> <nb>  <lis> <fct>   <dbl>   <dbl> <int> <fct>  <fct>     <int>
#>  1         0.522  <int> <dbl> 01          1      49     1 E      Ain       28870
#>  2         0.828  <int> <dbl> 02          1     812     2 N      Aisne     26226
#>  3         0.804  <int> <dbl> 03          1    1418     3 C      Allier    26747
#>  4         0.742  <int> <dbl> 04          1    1603     4 E      Basses…   12935
#>  5         0.231  <int> <dbl> 05          1    1802     5 E      Hautes…   17488
#>  6         0.839  <int> <dbl> 07          1    2249     7 S      Ardeche    9474
#>  7         0.623  <int> <dbl> 08          1   35395     8 N      Ardenn…   35203
#>  8         1.65   <int> <dbl> 09          1    2526     9 S      Ariege     6173
#>  9        -0.0198 <int> <dbl> 10          1   34410    10 E      Aube      19602
#> 10         0.695  <int> <dbl> 11          1    2807    11 S      Aude      15647
#> # … with 75 more rows, 31 more variables: local_moran$eii <dbl>, $var_ii <dbl>,
#> #   $z_ii <dbl>, $p_ii <dbl>, $p_ii_sim <dbl>, $p_folded_sim <dbl>,
#> #   $skewness <dbl>, $kurtosis <dbl>, $mean <fct>, $median <fct>, $pysal <fct>,
#> #   crime_prop <int>, literacy <int>, donations <int>, infants <int>,
#> #   suicides <int>, main_city <ord>, wealth <int>, commerce <int>,
#> #   clergy <int>, crime_parents <int>, infanticide <int>,
#> #   donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Now that we have a data frame, we need to unnest it.

lisa %>% 
  tidyr::unnest(local_moran)
#> Simple feature collection with 85 features and 40 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS:           NA
#> # A tibble: 85 × 41
#>         ii       eii   var_ii   z_ii    p_ii p_ii_sim p_folde…¹ skewness kurto…²
#>      <dbl>     <dbl>    <dbl>  <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
#>  1  0.522  -0.0749   0.366     0.987 0.324       0.38     0.19   0.0577   -0.419
#>  2  0.828   0.0279   0.122     2.29  0.0220      0.04     0.02  -0.150     0.236
#>  3  0.804  -0.0458   0.131     2.34  0.0190      0.02     0.01  -0.00508  -0.452
#>  4  0.742  -0.00421  0.193     1.70  0.0896      0.07     0.035 -0.0655   -0.203
#>  5  0.231  -0.000129 0.0346    1.24  0.214       0.26     0.13  -0.0132   -0.671
#>  6  0.839  -0.0636   0.259     1.77  0.0759      0.07     0.035 -0.288    -0.424
#>  7  0.623   0.00772  1.41      0.518 0.605       0.6      0.3    0.141     0.411
#>  8  1.65   -0.0125   1.23      1.50  0.135       0.09     0.045 -0.463     0.101
#>  9 -0.0198  0.000226 0.000471 -0.921 0.357       0.39     0.195 -0.258    -0.209
#> 10  0.695  -0.00809  0.0700    2.66  0.00787     0.02     0.01   0.0281   -0.257
#> # … with 75 more rows, 32 more variables: mean <fct>, median <fct>,
#> #   pysal <fct>, nb <nb>, wt <list>, code_dept <fct>, count <dbl>,
#> #   ave_id_geo <dbl>, dept <int>, region <fct>, department <fct>,
#> #   crime_pers <int>, crime_prop <int>, literacy <int>, donations <int>,
#> #   infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> #   commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> #   donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

This can then be used for visualization or further analysis.

Additionally, for other LISAs that can take any number of inputs, e.g. 3 or more numeric variables, we provide this as a list. Take for example the Local C statistic.

guerry_nb %>% 
  mutate(local_c = local_c_perm(list(crime_pers, wealth), nb, wt), 
         .before = 1) %>% 
  tidyr::unnest(local_c)
#> Simple feature collection with 85 features and 38 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS:           NA
#> # A tibble: 85 × 39
#>       ci cluster   e_ci var_ci   z_ci   p_ci p_ci_sim p_folde…¹ skewn…² kurtosis
#>    <dbl> <fct>    <dbl>  <dbl>  <dbl>  <dbl>    <dbl>     <dbl>   <dbl>    <dbl>
#>  1 1.53  Positive  2.52  0.708 -1.17  0.243     0.244     0.122   0.212 -0.153  
#>  2 0.500 Positive  1.75  0.384 -2.02  0.0439    0.016     0.008   0.435 -0.0433 
#>  3 0.642 Positive  1.64  0.226 -2.11  0.0348    0.024     0.012   0.177 -0.186  
#>  4 0.324 Positive  2.33  0.850 -2.17  0.0298    0.004     0.002   0.331 -0.183  
#>  5 0.298 Positive  2.28  1.02  -1.96  0.0495    0.02      0.01    0.214 -0.337  
#>  6 1.60  Positive  3.37  0.733 -2.07  0.0388    0.02      0.01    0.131 -0.376  
#>  7 2.04  Positive  3.29  1.63  -0.973 0.330     0.372     0.186   0.338 -0.271  
#>  8 2.20  Positive  3.41  1.69  -0.930 0.352     0.384     0.192   0.272 -0.167  
#>  9 0.507 Positive  1.74  0.441 -1.85  0.0640    0.032     0.016   0.406 -0.00872
#> 10 1.46  Positive  1.76  0.415 -0.468 0.640     0.668     0.334   0.400  0.137  
#> # … with 75 more rows, 29 more variables: nb <nb>, wt <list>, code_dept <fct>,
#> #   count <dbl>, ave_id_geo <dbl>, dept <int>, region <fct>, department <fct>,
#> #   crime_pers <int>, crime_prop <int>, literacy <int>, donations <int>,
#> #   infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> #   commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> #   donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>,
#> #   prostitutes <int>, distance <dbl>, area <int>, pop1831 <dbl>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names