The spnaf package is developed for calculating spatial network autocorrelation for flow data. Functions in the package are designed specifically to evaluate how networks are spatially clustered, in the form of \(G_{ij}\) statistic which is presented in the paper written by Berglund and Karlström.
The package has a dataset called CA which stands for California, US. This dataset contains migration amounts among CA counties in 2019. The data consists of origins and destinations of each residential flow.
dim(CA)
#> [1] 2580 12
head(CA)
#> State.Code.of.Geography.A FIPS.County.Code.of.Geography.A
#> 1 006 001
#> 2 006 001
#> 3 006 001
#> 4 006 001
#> 5 006 001
#> 6 006 001
#> State.U.S..Island.Area.Foreign.Region.Code.of.Geography.B
#> 1 006
#> 2 006
#> 3 006
#> 4 006
#> 5 006
#> 6 006
#> FIPS.County.Code.of.Geography.B State.Name.of.Geography.A
#> 1 005 California
#> 2 007 California
#> 3 009 California
#> 4 011 California
#> 5 013 California
#> 6 015 California
#> County.Name.of.Geography.A
#> 1 Alameda County
#> 2 Alameda County
#> 3 Alameda County
#> 4 Alameda County
#> 5 Alameda County
#> 6 Alameda County
#> State.U.S..Island.Area.Foreign.Region.of.Geography.B
#> 1 California
#> 2 California
#> 3 California
#> 4 California
#> 5 California
#> 6 California
#> County.Name.of.Geography.B Flow.from.Geography.B.to.Geography.A
#> 1 Amador County 0
#> 2 Butte County 304
#> 3 Calaveras County 49
#> 4 Colusa County 0
#> 5 Contra Costa County 8761
#> 6 Del Norte County 1
#> Counterflow.from.Geography.A.to.Geography.B
#> 1 97
#> 2 1259
#> 3 349
#> 4 29
#> 5 18007
#> 6 95
#> Net.Migration.from.Geography.B.to.Geography.A
#> 1 -97
#> 2 -955
#> 3 -300
#> 4 -29
#> 5 -9246
#> 6 -94
#> Gross.Migration.between.Geography.A.and.Geography.B
#> 1 97
#> 2 1563
#> 3 398
#> 4 29
#> 5 26768
#> 6 96
The package also has a sf object called CA_polygon which is a sf class object that represents boundaries of CA counties. It has id column and geometry column and can be plotted by attaching the sf package. The polygon can be joined with CA since it has id column that matches County code of CA. You can learn more about how to deal with spatial objects at https://r-spatial.github.io/sf/.
library(sf)
plot(CA_polygon, col = 'white', main = 'CA polygon')
spnaf package aims to measure spatial density of networks, which have origins (starting point) and destinations (ending point). Main function of spnaf is called Gij.polygon and the first main input of the function is df which is OD data in a data.frame form that must contain “oid”, “did”, and “n” (please refer to the help document) like CA above. The second important input is shape which is corresponding polygon object in sf class. The function also inherited two parameters from spdep such as queen, snap. The parameter method is one of c(“t”, “o”, “d”) which stand for total, origins only, and destinations only respectively (Please check this paper to get more information about the method). The last parameter n is used for bootstrapping permutation of resampling the individual statistic n times to generate a non-parametric distribution, since there would be a violation of the assumption of normality when one tries to calculate a spatial statistic with polygons(see how authors told about it in this paper). The process should be done to ensure a statistical significance of the statistic.
args(Gij.polygon)
#> function (df, shape, queen = TRUE, snap = 1, method = "t", R = 1000)
#> NULL
# Data manipulation
<- spnaf::CA
CA <- cbind(CA$FIPS.County.Code.of.Geography.B, CA$FIPS.County.Code.of.Geography.A)
OD <- cbind(OD, CA$Flow.from.Geography.B.to.Geography.A)
OD <- data.frame(OD)
OD names(OD) <- c("oid", "did", "n")
$n <- as.numeric(OD$n)
OD<- OD[order(OD[,1], OD[,2]),]
OD head(OD) # check the input df's format
#> oid did n
#> 63 001 005 97
#> 108 001 007 1259
#> 160 001 009 349
#> 201 001 011 29
#> 229 001 013 18007
#> 280 001 015 95
# Load sf polygon
<- spnaf::CA_polygon
CA_polygon head(CA_polygon) # it has geometry column
#> Simple feature collection with 6 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.4096 ymin: 35.90691 xmax: -118.3606 ymax: 41.46584
#> Geodetic CRS: NAD83
#> id geometry
#> 1 001 MULTIPOLYGON (((-122.3423 3...
#> 2 003 MULTIPOLYGON (((-120.0724 3...
#> 3 005 MULTIPOLYGON (((-121.0274 3...
#> 4 013 MULTIPOLYGON (((-122.4298 3...
#> 5 019 MULTIPOLYGON (((-120.9094 3...
#> 6 023 MULTIPOLYGON (((-124.4086 4...
# Execution of Gij.polygon with data above and given parameters
<- Gij.polygon(df = OD, shape = CA_polygon, queen = TRUE, snap = 1,
result method = 't', R = 1000)
#> (1) Creating Spatial Weights... Done !
#> note: Total 3306 network combinations are ready to be analyzed
#> note: which is calculated by the input polygon data
#> (2) Calculating Gij ... Done!
#> (3) Do bootstrapping ... Done !
#> (4) Generating result in lines ... Done !
The metric, an extended statistic of Getis and Ord (1992), \(G_{i}^{*}\), has similar intuition of hotspot analysis with static data: a high and significant value in a flow indicates spatial clustering of flows with high values. it can be interpreted as Z-value suitable for conducting statistical tests as the metric inherited the characteristics of \(G_{i}^{*}\). If one conducted bootstrapping for 1,000 times like above, those with a value greater than the 50th largest value of the distribution (i.e., at the significance level of 0.05) can be defined as positive clusters.
# positive clusters at the significance level of 0.05
head(result[[1]][result[[1]]$pval < 0.05,])
#> oid did n Gij pval
#> 2 001 005 97 1.568175 0.048
#> 6 001 013 18007 7.456884 0.005
#> 18 001 037 3957 2.754069 0.041
#> 20 001 041 959 2.281708 0.040
#> 23 001 047 586 2.548161 0.034
#> 29 001 059 953 1.960098 0.038
# positive clusters at the significance level of 0.05 in lines class
head(result[[2]][result[[2]]$pval < 0.05,])
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: -122.7238 ymin: 33.7029 xmax: -117.7608 ymax: 38.44661
#> Geodetic CRS: NAD83
#> oid did n Gij pval sfc
#> 2 001 005 97 1.568175 0.048 LINESTRING (-121.8887 37.64...
#> 6 001 013 18007 7.456884 0.005 LINESTRING (-121.8887 37.64...
#> 18 001 037 3957 2.754069 0.041 LINESTRING (-121.8887 37.64...
#> 20 001 041 959 2.281708 0.040 LINESTRING (-121.8887 37.64...
#> 23 001 047 586 2.548161 0.034 LINESTRING (-121.8887 37.64...
#> 29 001 059 953 1.960098 0.038 LINESTRING (-121.8887 37.64...
library(tmap)
# plot all flows with the polygon (left)
tm_shape(CA_polygon) +
tm_polygons()+
tm_shape(result[[2]]) +
tm_lines()
# plot significant flows only with the polygon (right)
tm_shape(CA_polygon) +
tm_polygons()+
tm_shape(result[[2]][result[[2]]$pval < 0.05,]) +
tm_lines(col='pval')