3. Application: matching

One common type of task with spatio-temporal data is to match nearby sites. For example, we may want to verify the location of an old list of stations with current stations, or we may want to match the data from different data sources. Some of these matches only concern the spatial dimension, while others require temporal agreement.

This vignette introduces how to spatially and spatio-temporally match sites with the cubble structure with two examples. The first example pairs traditional weather stations with nearby automated stations in New South Wales, Australia. This exercise only concerns the matching based on spherical distance between stations. The next example pairs the river level recorded by the river gauges with the precipitation recorded by the nearby weather station in Victoria, Australia.

Spatial matching

Again we will start with prcp_aus to look at precipitation and focus on New South Wales stations since that is where most automated stations are implemented. The figure below shows the location of traditional and automated weather station on the map:

nsw_map <- ozmaps::abs_ste %>%  
  filter(NAME == "New South Wales") 

nsw <- prcp_aus %>%    
  # subset for New South Wales stations
  filter(between(stringr::str_sub(id, 7, 8), 46, 75)) %>% 
  mutate(automated = stringr::str_detect(name, "aws")) %>%  
  face_temporal(ts) %>% 
  filter(lubridate::month(date) == 1,
         lubridate::year(date) == 2020) %>%  
  face_spatial() %>%  
  filter(!any(is.na(ts$prcp)))
 
ggplot() +
  geom_sf(data = nsw_map, color = "grey", linetype = "dotted") +
  geom_point(data = nsw, aes(x = long, y = lat, color = automated)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "Longitude", y = "Latitude") + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("New Sourth Wales") + 
  coord_sf(xlim = c(141, 154))

In the map we can see some traditional and automated weather stations are close to each other. This can be useful information to cross validate recordings from different types of weather stations.

In cubble, match_sites() houses match_spatial() and match_temporal(). For a spatial-only matching, you can use match_sites(temporal_matching = FALSE) or simply match_spatial().

Any matching requires two datasets in the cubble and we call them major and minor. Major and minor dataset differs from how distance is calculated. Spatial matching calculates the spherical distance using the Vincenty formula and this distance is calculated from each site in the major dataset is to every site in the minor dataset.

Once the distance is calculated, three arguments are available to refine the matching results:

The order that these three arguments applied will slightly affect the results in cubble. spatial_n_keep, default to 1, is first applied to keep n site(s) for each major site, spaital_dist_max, default to 10, is then applied to filter out the pairs with distance larger than this maximum distance. spatial_single_match is lastly applied to resolve the scenario where site a (minor) is the closest match for both site A and B (major) with distance 5km and 8km. If spatial_single_match = TRUE, a will only be matched to the major site with the smaller distance, that is, site A here.

Let’s get back to the weather stations.

We first construct the major site auto and minor non_auto by filtering on whether stations are automated or not. Here we would like to find each station in auto a match in non_auto. Hence auto is the major dataset and non_auto is the minor in the match_sites():

auto <- nsw %>%  filter(automated)
non_auto <- nsw %>%  filter(!automated)

matched <- match_sites(auto, non_auto, temporal_matching = FALSE) 

The result from the pairing is also a cubble with two additional columns: dist as the distance between the pair and group as the grouping index:

matched 
#> # cubble:   id [18]: nested form
#> # bbox:     [145.79, -36.11, 151.67, -30.52]
#> # temporal: date [date], prcp [dbl]
#>    id            lat  long   elev name     wmo_id automated ts        dist group
#>    <chr>       <dbl> <dbl>  <dbl> <chr>     <dbl> <lgl>     <list>   <dbl> <int>
#>  1 ASN00050137 -33.1  147.  193.  condobo…  95708 TRUE      <tibble>  1.41     1
#>  2 ASN00050052 -33.1  147.  195   condobo…  94707 FALSE     <tibble>  1.41     1
#>  3 ASN00056238 -30.5  152. 1079   armidal…  95773 TRUE      <tibble>  5.16     2
#>  4 ASN00056037 -30.5  152.  987   armidal…  94773 FALSE     <tibble>  5.16     2
#>  5 ASN00064017 -31.3  149.  643   coonaba…  95728 TRUE      <tibble>  6.58     3
#>  6 ASN00064008 -31.3  149.  505   coonaba…  94728 FALSE     <tibble>  6.58     3
#>  7 ASN00048237 -31.5  146.  218   cobar a…  94710 TRUE      <tibble>  6.86     4
#>  8 ASN00048027 -31.5  146.  260   cobar mo  94711 FALSE     <tibble>  6.86     4
#>  9 ASN00066194 -33.9  151.    3   canterb…  94766 TRUE      <tibble>  7.14     5
#> 10 ASN00066037 -33.9  151.    6   sydney …  94767 FALSE     <tibble>  7.14     5
#> 11 ASN00068192 -34.0  151.   73.9 camden …  94755 TRUE      <tibble>  8.17     6
#> 12 ASN00068257 -34.1  151.  112   campbel…  94757 FALSE     <tibble>  8.17     6
#> 13 ASN00072160 -36.1  147.  164.  albury …  95896 TRUE      <tibble>  8.33     7
#> 14 ASN00072023 -36.1  147.  184   hume re…  94901 FALSE     <tibble>  8.33     7
#> 15 ASN00067113 -33.7  151.   24.7 penrith…  94763 TRUE      <tibble>  8.77     8
#> 16 ASN00063077 -33.7  151.  320   springw…  95744 FALSE     <tibble>  8.77     8
#> 17 ASN00063291 -33.4  150.  744.  bathurs…  94729 TRUE      <tibble>  9.30     9
#> 18 ASN00063005 -33.4  150.  713   bathurs…  94730 FALSE     <tibble>  9.30     9

Then we can create visualisation to see where these pairs are in the map:

ggplot() + 
  geom_sf(data = nsw_map) + 
  geom_point(data = matched, 
             aes(x = long, y = lat, color = automated)) + 
  ggrepel::geom_label_repel(
    data = matched %>%  filter(automated),
    aes(x = long, y = lat, label = group)) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("New South Wales") + 
  theme_minimal() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  coord_sf(xlim = c(141, 154))

or compare the series within the same pair (as an example here we only look at records in Jan 2020):

We can see that in general the peaks of precipitation recorded by traditional and automated stations from our pairs are matched, while the exact read may need to be further calibrated.

Spatio-temporal matching

Bureau of Meteorology collects water data from river gauges and this includes variables: electrical conductivity, turbidity, water course discharge, water course level, and water temperature. In particular, water level will interactive with precipitation from the climate data since rainfall will raise the water level in the river. Here is the location of available weather station and water gauges in Victoria:

Here we provide more details on how temporal matching works in cubble. Suppose two locations have been matched spatially and temproal matching will be conducted on variable A and a in the plot below: .

We first find the n peaks in each series (3 peaks here). A variable needs to be specified in temporal_independent for construct an interval. Here we pick variable A and construct an interval with a default length of 5. The peaks in variable a are then tested against whether they fall into the any of the intervals constructed from A. In this illustration, there are 2 matches for these two variable The available tuning parameter in temporal matches are:

In the river level and precipitation example, Water_course_level in river will be matched to prcp in climate. This can be specified in temporal_by, an analogue to the by syntax in join. The goal in this example is to see if precipitation will be reflected by the water level in the river and this puts precipitation prcp, as the independent. Given there is one year worth of data, the number of peak (temporal_n_highest) to consider is slightly raised from a default 20 to 30 and temporal_min_match is raised accordingly.

res <- match_sites(river, vic,
                   temporal_by = c("Water_course_level" = "prcp"),
                   temporal_independent = "prcp",
                   temporal_n_highest = 30,
                   temporal_min_match = 15)

The output from temporal matching is also a cubble with n_match for the number of matched temporal peaks (on top of the dist and group from spatial matching):

res
#> # cubble:   id [8]: nested form
#> # bbox:     [144.52, -37.73, 146.06, -36.55]
#> # temporal: date [date], matched_var [dbl]
#>   id          name                  lat  long type   dist group ts       n_match
#>   <chr>       <chr>               <dbl> <dbl> <chr> <dbl> <int> <list>     <int>
#> 1 405234      SEVEN CREEKS @ D/S… -36.9  146. river  6.15     5 <tibble>      21
#> 2 404207      HOLLAND CREEK @ KE… -36.6  146. river  8.54    10 <tibble>      21
#> 3 ASN00082042 strathbogie         -36.8  146. clim…  6.15     5 <tibble>      21
#> 4 ASN00082170 benalla airport     -36.6  146. clim…  8.54    10 <tibble>      21
#> 5 230200      MARIBYRNONG RIVER … -37.7  145. river  6.17     6 <tibble>      19
#> 6 ASN00086038 essendon airport    -37.7  145. clim…  6.17     6 <tibble>      19
#> 7 406213      CAMPASPE RIVER @ R… -37.0  145. river  1.84     1 <tibble>      18
#> 8 ASN00088051 redesdale           -37.0  145. clim…  1.84     1 <tibble>      18

We can look at the matched pair on the map:

ggplot() + 
  geom_sf(data = vic_map) + 
  geom_point(data = res, 
             aes(x = long, y = lat, color = type)) + 
  ggrepel::geom_label_repel(data = res %>%  filter(type == "river"),
                            aes(x = long, y = lat, label = group)) + 
  scale_color_brewer(palette = "Dark2") + 
  ggplot2::theme_bw() +
  ggplot2::theme(legend.position = "bottom") +
  ggplot2::labs(x = "Longitude", y = "Latitude") + 
  ggtitle("Victoria") 

or to look at the series:

res_long <- res %>%  
  face_temporal(ts) %>%  
  unfold(group, type) %>%  
  rename(prcp = matched_var) %>%  
  mutate(prcp = (prcp - min(prcp, na.rm = TRUE))/ (max(prcp, na.rm = TRUE) - min(prcp, na.rm = TRUE))) 

res_long %>%  
  ggplot(aes(x = date, y = prcp, group = type,color = type)) + 
  geom_line() + 
  facet_wrap(vars(group)) + 
  scale_color_brewer(palette = "Dark2", guide = "none") + 
  theme_bw() + 
  labs(x=  "date") + 
  scale_x_date(date_labels = "%b") + 
  labs(x = "Week", y = "Precipitation/ water level")

There are four pairs of matches - all locates in the middle Victoria and we can observe concurrent increase of precipitation and water level (precipitation and water level have been standardised between 0 and 1 to be displayed on the same scale).