library(sfdep)
library(dplyr)
library(ggplot2)
The goal of this vignette is to familiarize you with the basics of sfdep:
sfdep provides users with a way to conduct “Exploratory Spatial Data Analysis”, or ESDA for short. ESDA differs from typical exploratory data analysis in that we are strictly exploring spatial relationships. As you might have guessed, ESDA evaluates whether the phenomena captured in your data are dependent upon space–or are spatially auto-correlated. Much of ESDA is focused on “Local Indicators of Spatial Association”, LISAs for short. LISAs are measures that are developed to identify whether some observed pattern is truly random or impacted by its relationship in space.
Much of the philosophy of LISAs and ESDA are captured in Tobler’s First Law of Geography
“Everything is related to everything else. But near things are more related than distant things.” - Waldo R. Tobler, 1969
It’s tough to state this any more simply. Things that are next to each other tend to be more similar than things that are further away.
To assess whether near things are related and further things less so, we typically lattice data. A lattice is created when a landscape or region is divided into sub-areas. Most naturally, these types of data are represented as vector polygons.
To describe neighbors I’m going to steal extensively from my own post “Understanding Spatial Autocorrelation”.
If we assume that there is a spatial relationship in our data, we are taking on the belief that our data are not completely independent of each other. If nearer things are more related, then census tracts that are close to each other will have similar values.
In order to evaluate whether nearer things are related, we must know what observations are nearby. With polygon data we identify neighbors based on their contiguity. To be contiguous means to be connected or touching—think of the contiguous lower 48 states.
The two most common contiguities are based on the game of chess. Let’s take a simple chess board.
In chess each piece can move in a different way. All pieces, with the exception of the knight, move either diagonally or horizontally and vertically. The most common contiguities are queen and rook contiguities. In chess, a queen can move diagonally and horizontal and vertically whereas a rook can only move horizontal and vertically.
We extend this idea to polygons. Queen contiguities identify neighbors based on any polygon that is touching. With rook contiguities, we identify neighbors based on polygons that touch on the side. For most social science research, we only need to be concerned with queen contiguities.
While a chess board might make intuitive sense, geographies are
really wonky in real life. Below is map of the 47th observation in the
guerry
object and it’s queen contiguity neighbors.
You can see that any polygon that is touching, even at a corner, will be considered a neighbor to the point in question. This will be done for every polygon in our data set.
Once neighbors are identified, they can then be used to calculate
spatial weights. The typical method of calculating the
spatial weights is through row standardization
(st_weights(nb, style = "W")
). Each neighbor that touches
our census tract will be assigned an equal weight. We do this by
assigning each neighbor a value of 1 then dividing by the number of
neighbors. If we have 5 neighboring census tracts, each of them will
have a spatial weight of 0.2 (1 / 5 = 0.2).
Going back to the chess board example, we can take the position d4 and look at the queen contiguities. There are 8 squares that immediately touch the square. Each one of these squares is considered a neighbor and given a value of 1. Then each square is divided by the total number or neighbors, 8.
Very simply it looks like the following
<- rep(1, 8))
(d4_nbs #> [1] 1 1 1 1 1 1 1 1
/ length(d4_nbs)
d4_nbs #> [1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
sfdep utilizes list objects for both neighbors and weights. The neighbors and weights lists.
To identify contiguity-based neighbors, we use
st_contiguity()
on the sf geometry column. And to calculate
the weights from the neighbors list, we use st_weights()
on
the resultant neighbors list. By convention these are typically called
nb
and wt
.
These lists can be created line by line or within a pipe. The most common usecase is likely via a dplyr pipeline.
<- guerry %>%
guerry_nb mutate(nb = st_contiguity(geometry),
wt = st_weights(nb),
.before = 1) # to put them in the front
guerry_nb#> Simple feature collection with 85 features and 28 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS: NA
#> # A tibble: 85 × 29
#> nb wt code_dept count ave_i…¹ dept region depar…² crime…³ crime…⁴
#> * <nb> <list> <fct> <dbl> <dbl> <int> <fct> <fct> <int> <int>
#> 1 <int [4]> <dbl> 01 1 49 1 E Ain 28870 15890
#> 2 <int [6]> <dbl> 02 1 812 2 N Aisne 26226 5521
#> 3 <int [6]> <dbl> 03 1 1418 3 C Allier 26747 7925
#> 4 <int [4]> <dbl> 04 1 1603 4 E Basses… 12935 7289
#> 5 <int [3]> <dbl> 05 1 1802 5 E Hautes… 17488 8174
#> 6 <int [7]> <dbl> 07 1 2249 7 S Ardeche 9474 10263
#> 7 <int [3]> <dbl> 08 1 35395 8 N Ardenn… 35203 8847
#> 8 <int [3]> <dbl> 09 1 2526 9 S Ariege 6173 9597
#> 9 <int [5]> <dbl> 10 1 34410 10 E Aube 19602 4086
#> 10 <int [5]> <dbl> 11 1 2807 11 S Aude 15647 10431
#> # … with 75 more rows, 19 more variables: literacy <int>, donations <int>,
#> # infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> # commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> # donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>,
#> # prostitutes <int>, distance <dbl>, area <int>, pop1831 <dbl>,
#> # geometry <MULTIPOLYGON>, and abbreviated variable names ¹ave_id_geo,
#> # ²department, ³crime_pers, ⁴crime_prop
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
To calculate LISAs we typically will provide a numeric object(s), a
neighbor list, and a weights list–and often the argument
nsim
to determine the number of simulations to run. Most
LISAs return a data frame of the same number of rows as the input
dataframe. The resultant data frame can be unnested, or columns hoisted
for ease of analysis.
For example to calculate the Local Moran we use the function
local_moran()
<- guerry_nb %>%
lisa mutate(local_moran = local_moran(crime_pers, nb, wt, nsim = 199),
.before = 1)
lisa#> Simple feature collection with 85 features and 29 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS: NA
#> # A tibble: 85 × 30
#> local_moran$ii nb wt code_…¹ count ave_i…² dept region depar…³ crime…⁴
#> * <dbl> <nb> <lis> <fct> <dbl> <dbl> <int> <fct> <fct> <int>
#> 1 0.522 <int> <dbl> 01 1 49 1 E Ain 28870
#> 2 0.828 <int> <dbl> 02 1 812 2 N Aisne 26226
#> 3 0.804 <int> <dbl> 03 1 1418 3 C Allier 26747
#> 4 0.742 <int> <dbl> 04 1 1603 4 E Basses… 12935
#> 5 0.231 <int> <dbl> 05 1 1802 5 E Hautes… 17488
#> 6 0.839 <int> <dbl> 07 1 2249 7 S Ardeche 9474
#> 7 0.623 <int> <dbl> 08 1 35395 8 N Ardenn… 35203
#> 8 1.65 <int> <dbl> 09 1 2526 9 S Ariege 6173
#> 9 -0.0198 <int> <dbl> 10 1 34410 10 E Aube 19602
#> 10 0.695 <int> <dbl> 11 1 2807 11 S Aude 15647
#> # … with 75 more rows, 31 more variables: local_moran$eii <dbl>, $var_ii <dbl>,
#> # $z_ii <dbl>, $p_ii <dbl>, $p_ii_sim <dbl>, $p_folded_sim <dbl>,
#> # $skewness <dbl>, $kurtosis <dbl>, $mean <fct>, $median <fct>, $pysal <fct>,
#> # crime_prop <int>, literacy <int>, donations <int>, infants <int>,
#> # suicides <int>, main_city <ord>, wealth <int>, commerce <int>,
#> # clergy <int>, crime_parents <int>, infanticide <int>,
#> # donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Now that we have a data frame, we need to unnest it.
%>%
lisa ::unnest(local_moran)
tidyr#> Simple feature collection with 85 features and 40 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS: NA
#> # A tibble: 85 × 41
#> ii eii var_ii z_ii p_ii p_ii_sim p_folde…¹ skewness kurto…²
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.522 -0.0749 0.366 0.987 0.324 0.38 0.19 0.0577 -0.419
#> 2 0.828 0.0279 0.122 2.29 0.0220 0.04 0.02 -0.150 0.236
#> 3 0.804 -0.0458 0.131 2.34 0.0190 0.02 0.01 -0.00508 -0.452
#> 4 0.742 -0.00421 0.193 1.70 0.0896 0.07 0.035 -0.0655 -0.203
#> 5 0.231 -0.000129 0.0346 1.24 0.214 0.26 0.13 -0.0132 -0.671
#> 6 0.839 -0.0636 0.259 1.77 0.0759 0.07 0.035 -0.288 -0.424
#> 7 0.623 0.00772 1.41 0.518 0.605 0.6 0.3 0.141 0.411
#> 8 1.65 -0.0125 1.23 1.50 0.135 0.09 0.045 -0.463 0.101
#> 9 -0.0198 0.000226 0.000471 -0.921 0.357 0.39 0.195 -0.258 -0.209
#> 10 0.695 -0.00809 0.0700 2.66 0.00787 0.02 0.01 0.0281 -0.257
#> # … with 75 more rows, 32 more variables: mean <fct>, median <fct>,
#> # pysal <fct>, nb <nb>, wt <list>, code_dept <fct>, count <dbl>,
#> # ave_id_geo <dbl>, dept <int>, region <fct>, department <fct>,
#> # crime_pers <int>, crime_prop <int>, literacy <int>, donations <int>,
#> # infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> # commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> # donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
This can then be used for visualization or further analysis.
Additionally, for other LISAs that can take any number of inputs, e.g. 3 or more numeric variables, we provide this as a list. Take for example the Local C statistic.
%>%
guerry_nb mutate(local_c = local_c_perm(list(crime_pers, wealth), nb, wt),
.before = 1) %>%
::unnest(local_c)
tidyr#> Simple feature collection with 85 features and 38 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS: NA
#> # A tibble: 85 × 39
#> ci cluster e_ci var_ci z_ci p_ci p_ci_sim p_folde…¹ skewn…² kurtosis
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.53 Positive 2.52 0.708 -1.17 0.243 0.244 0.122 0.212 -0.153
#> 2 0.500 Positive 1.75 0.384 -2.02 0.0439 0.016 0.008 0.435 -0.0433
#> 3 0.642 Positive 1.64 0.226 -2.11 0.0348 0.024 0.012 0.177 -0.186
#> 4 0.324 Positive 2.33 0.850 -2.17 0.0298 0.004 0.002 0.331 -0.183
#> 5 0.298 Positive 2.28 1.02 -1.96 0.0495 0.02 0.01 0.214 -0.337
#> 6 1.60 Positive 3.37 0.733 -2.07 0.0388 0.02 0.01 0.131 -0.376
#> 7 2.04 Positive 3.29 1.63 -0.973 0.330 0.372 0.186 0.338 -0.271
#> 8 2.20 Positive 3.41 1.69 -0.930 0.352 0.384 0.192 0.272 -0.167
#> 9 0.507 Positive 1.74 0.441 -1.85 0.0640 0.032 0.016 0.406 -0.00872
#> 10 1.46 Positive 1.76 0.415 -0.468 0.640 0.668 0.334 0.400 0.137
#> # … with 75 more rows, 29 more variables: nb <nb>, wt <list>, code_dept <fct>,
#> # count <dbl>, ave_id_geo <dbl>, dept <int>, region <fct>, department <fct>,
#> # crime_pers <int>, crime_prop <int>, literacy <int>, donations <int>,
#> # infants <int>, suicides <int>, main_city <ord>, wealth <int>,
#> # commerce <int>, clergy <int>, crime_parents <int>, infanticide <int>,
#> # donation_clergy <int>, lottery <int>, desertion <int>, instruction <int>,
#> # prostitutes <int>, distance <dbl>, area <int>, pop1831 <dbl>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names