0. Import data as a cubble

This article shows you how to create a cubble from data in the wild. You should have already seen an example of constructing a cubble from a tibble in the README page and here are more examples that construct a cubble from:

A cubble from tibble with list column

List column can be a useful structure when querying data since you can first create a row-wise data frame with all the metadata and then query the data as a nested list. We will show how to convert this output into a cubble in a small example with five weather stations close to Sydney, Australia. Here is the metadata of the stations:

syd
#> # A tibble: 5 × 6
#>   id            lat  long  elev name                           wmo_id
#>   <chr>       <dbl> <dbl> <dbl> <chr>                           <dbl>
#> 1 ASN00066062 -33.9  151.    39 sydney (observatory hill)       94768
#> 2 ASN00066037 -33.9  151.     6 sydney airport amo              94767
#> 3 ASN00066194 -33.9  151.     3 canterbury racecourse aws       94766
#> 4 ASN00066124 -33.8  151.    55 parramatta north (masons drive  94764
#> 5 ASN00066059 -33.7  151.   199 terrey hills aws                94759

Temporal variables can be queried by rnoaa::meteo_pull_monitors() with the date range and variable supplied. Here we turn the stations data into a rowwise data frame and then query the climate variables as a list-column, ts:

raw <- syd %>%  
  rowwise() %>%  
  mutate(ts = list(rnoaa::meteo_pull_monitors(id, 
                                       date_min = "2020-01-01", 
                                       date_max = "2020-12-31",
                                       var = c("PRCP", "TMAX", "TMIN")) %>%  select(-id))) 
raw
#> # A tibble: 5 × 7
#> # Rowwise: 
#>   id            lat  long  elev name                           wmo_id ts      
#>   <chr>       <dbl> <dbl> <dbl> <chr>                           <dbl> <list>  
#> 1 ASN00066062 -33.9  151.    39 sydney (observatory hill)       94768 <tibble>
#> 2 ASN00066037 -33.9  151.     6 sydney airport amo              94767 <tibble>
#> 3 ASN00066194 -33.9  151.     3 canterbury racecourse aws       94766 <tibble>
#> 4 ASN00066124 -33.8  151.    55 parramatta north (masons drive  94764 <tibble>
#> 5 ASN00066059 -33.7  151.   199 terrey hills aws                94759 <tibble>

A cubble can then be created by supplying:

syd_climate <- raw %>%  
  as_cubble(key = id, index = date, coords = c(long, lat))

syd_climate
#> # cubble:   id [5]: nested form
#> # bbox:     [151.01, -33.95, 151.23, -33.69]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#>   id            lat  long  elev name                           wmo_id ts      
#>   <chr>       <dbl> <dbl> <dbl> <chr>                           <dbl> <list>  
#> 1 ASN00066062 -33.9  151.    39 sydney (observatory hill)       94768 <tibble>
#> 2 ASN00066037 -33.9  151.     6 sydney airport amo              94767 <tibble>
#> 3 ASN00066194 -33.9  151.     3 canterbury racecourse aws       94766 <tibble>
#> 4 ASN00066124 -33.8  151.    55 parramatta north (masons drive  94764 <tibble>
#> 5 ASN00066059 -33.7  151.   199 terrey hills aws                94759 <tibble>

A cubble from tsibble

If you have already got a tsibble object, with key and index registered, only coords needs to be specified to create a cubble:

dt
#> # A tsibble: 1,830 x 10 [1D]
#> # Key:       id [5]
#>    id            lat  long  elev name        wmo_id date        prcp  tmax  tmin
#>    <chr>       <dbl> <dbl> <dbl> <chr>        <dbl> <date>     <dbl> <dbl> <dbl>
#>  1 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-01     0  31.9  15.3
#>  2 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-02     0  24.9  16.4
#>  3 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-03     6  23.2  13  
#>  4 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-04     0  28.4  12.4
#>  5 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-05     0  35.3  11.6
#>  6 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-06     0  34.8  13.1
#>  7 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-07     0  32.8  15.1
#>  8 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-08     0  30.4  17.4
#>  9 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-09     0  28.7  17.3
#> 10 ASN00009021 -31.9  116.  15.4 perth airp…  94610 2020-01-10     0  32.6  15.8
#> # … with 1,820 more rows
dt %>%  as_cubble(coords = c(long, lat))
#> # cubble:   id [5]: nested form
#> # bbox:     [115.97, -32.94, 133.55, -12.42]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#>   id            lat  long  elev name           wmo_id ts                
#>   <chr>       <dbl> <dbl> <dbl> <chr>           <dbl> <list>            
#> 1 ASN00009021 -31.9  116.  15.4 perth airport   94610 <tbl_ts [366 × 4]>
#> 2 ASN00010311 -31.9  117. 179   york            94623 <tbl_ts [366 × 4]>
#> 3 ASN00010614 -32.9  117. 338   narrogin        94627 <tbl_ts [366 × 4]>
#> 4 ASN00014015 -12.4  131.  30.4 darwin airport  94120 <tbl_ts [366 × 4]>
#> 5 ASN00015131 -17.6  134. 220   elliott         94236 <tbl_ts [366 × 4]>

Notice here each element of the list-column ts is of class tbl_ts and you are free to apply your favourite function on the tsibble class :)

Separate spatial and temporal tables

Sometimes, you may get the spatial and temporal data from different sources and they live in two separate tables with a linking variable:

# a spatial sheet
cubble::stations
#> # A tibble: 5 × 6
#>   id            lat  long  elev name           wmo_id
#>   <chr>       <dbl> <dbl> <dbl> <chr>           <dbl>
#> 1 ASN00009021 -31.9  116.  15.4 perth airport   94610
#> 2 ASN00010311 -31.9  117. 179   york            94623
#> 3 ASN00010614 -32.9  117. 338   narrogin        94627
#> 4 ASN00014015 -12.4  131.  30.4 darwin airport  94120
#> 5 ASN00015131 -17.6  134. 220   elliott         94236

# a temporal sheet
cubble::climate
#> # A tibble: 1,830 × 5
#>    id          date        prcp  tmax  tmin
#>    <chr>       <date>     <dbl> <dbl> <dbl>
#>  1 ASN00009021 2020-01-01     0  31.9  15.3
#>  2 ASN00009021 2020-01-02     0  24.9  16.4
#>  3 ASN00009021 2020-01-03     6  23.2  13  
#>  4 ASN00009021 2020-01-04     0  28.4  12.4
#>  5 ASN00009021 2020-01-05     0  35.3  11.6
#>  6 ASN00009021 2020-01-06     0  34.8  13.1
#>  7 ASN00009021 2020-01-07     0  32.8  15.1
#>  8 ASN00009021 2020-01-08     0  30.4  17.4
#>  9 ASN00009021 2020-01-09     0  28.7  17.3
#> 10 ASN00009021 2020-01-10     0  32.6  15.8
#> # … with 1,820 more rows

When created out of separate tables, as_cubble() will check on the match of linking variable in the spatial and temporal table.

as_cubble(list(spatial = cubble::stations, temporal = cubble::climate),
          key = id, index = date, coords = c(long, lat))
#> # cubble:   id [5]: nested form
#> # bbox:     [115.97, -32.94, 133.55, -12.42]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#>   id            lat  long  elev name           wmo_id ts                
#>   <chr>       <dbl> <dbl> <dbl> <chr>           <dbl> <list>            
#> 1 ASN00009021 -31.9  116.  15.4 perth airport   94610 <tibble [366 × 4]>
#> 2 ASN00010311 -31.9  117. 179   york            94623 <tibble [366 × 4]>
#> 3 ASN00010614 -32.9  117. 338   narrogin        94627 <tibble [366 × 4]>
#> 4 ASN00014015 -12.4  131.  30.4 darwin airport  94120 <tibble [366 × 4]>
#> 5 ASN00015131 -17.6  134. 220   elliott         94236 <tibble [366 × 4]>

Messages will emit to notify the unmatched, if detected. This would be useful to catch some sites with only spatial/ temporal information and the slight mismatch in character linking variable.

A cubble from NetCDF data

NetCDF (Network Common Data Form) is a commonly used data format in the climatology community to deliver global mapping of atmosphere, ocean, and land. Here we provide some general information regarding this format before showing an example of creating a cubble out of NetCDF data.

NetCDF data has two main components:

Attributes are usually associated with dimension and variable in the NetCDF format data. A few packages in R exists for manipulating NetCDF data and this includes a high-level R interface: ncdf4, a low-level interface that calls C interface: RNetCDF, and a tidyverse implementation: tidync. Here let’s take a look at a NetCDF data:

path <- system.file("ncdf/era5-pressure.nc", package = "cubble")
raw <- ncdf4::nc_open(path)
raw
#> File /private/var/folders/_t/v9kjp3yn2k73wm16y_jlphgsbbd6_6/T/Rtmpda2m05/Rinst27fc69f8d93d/cubble/ncdf/era5-pressure.nc (NC_FORMAT_64BIT):
#> 
#>      2 variables (excluding dimension variables):
#>         short q[longitude,latitude,time]   
#>             scale_factor: 2.09848696659051e-11
#>             add_offset: 3.16766314740189e-06
#>             _FillValue: -32767
#>             missing_value: -32767
#>             units: kg kg**-1
#>             long_name: Specific humidity
#>             standard_name: specific_humidity
#>         short z[longitude,latitude,time]   
#>             scale_factor: 0.154814177589917
#>             add_offset: 306067.078842911
#>             _FillValue: -32767
#>             missing_value: -32767
#>             units: m**2 s**-2
#>             long_name: Geopotential
#>             standard_name: geopotential
#> 
#>      3 dimensions:
#>         longitude  Size:161 
#>             units: degrees_east
#>             long_name: longitude
#>         latitude  Size:165 
#>             units: degrees_north
#>             long_name: latitude
#>         time  Size:6 
#>             units: hours since 1900-01-01 00:00:00.0
#>             long_name: time
#>             calendar: gregorian
#> 
#>     2 global attributes:
#>         Conventions: CF-1.6
#>         history: 2022-04-17 01:14:41 GMT by grib_to_netcdf-2.24.3: /opt/ecmwf/mars-client/bin/grib_to_netcdf -S param -o /cache/data8/adaptor.mars.internal-1650158081.486927-5264-6-90c05c05-ca42-4b7a-b6f8-d9617d0f0500.nc /cache/tmp/90c05c05-ca42-4b7a-b6f8-d9617d0f0500-adaptor.mars.internal-1650158080.1120768-5264-12-tmp.grib

In a NetCDF data, it is not the the actual data that gets directly printed out, but the metadata. There are 2 variables and 3 dimensions in this data and each is associated with a few attributes. In this data the attributes for the two variables includes the scaling and offset parameter, representation of the missing and its fill value, along with its unit and names. In NetCDF, data is stored in its packed value to save space and sometimes, you will need to use a formula like \(\text{unpacked value} = \text{packed value} \times \text{scale factor} + \text{add offset}\) to unpack the data. Luckily, when reading in the NetCDF data with the ncdf4 package, it has already unpack the data for you, so no need to worry about the scaling and offset.

In principle, NetCDF data can store data with arbitrary variable, dimension, and attribute and this will cause a chaos to generalise its manipulation. Metadata convention for climate and forecast (CF convention) is a guideline that has been designed to standardise the format of NetCDF data. Thanks to the CF convention, cubble can now extract specific components as per the CF convention to build a cubble from NetCDF data.

Cubble provides an as_cubble() method to coerce the ncdf4 class from the ncdf4 package into a cubble. It maps each combination of longitude and latitude into an id as the key:

dt <- as_cubble(raw, vars = c("q", "z"))
dt
#> # cubble:   id [26565]: nested form
#> # bbox:     [113, -53, 153, -12]
#> # temporal: time [dttm], q [dbl], z [dbl]
#>       id  long   lat ts              
#>    <int> <dbl> <dbl> <list>          
#>  1     1  113    -12 <tibble [6 × 3]>
#>  2     2  113.   -12 <tibble [6 × 3]>
#>  3     3  114.   -12 <tibble [6 × 3]>
#>  4     4  114.   -12 <tibble [6 × 3]>
#>  5     5  114    -12 <tibble [6 × 3]>
#>  6     6  114.   -12 <tibble [6 × 3]>
#>  7     7  114.   -12 <tibble [6 × 3]>
#>  8     8  115.   -12 <tibble [6 × 3]>
#>  9     9  115    -12 <tibble [6 × 3]>
#> 10    10  115.   -12 <tibble [6 × 3]>
#> # … with 26,555 more rows

The memory limit with NetCDF data in cubble depends on longitude grid point x latitude grid point x time grid point x number of variable. Cubble can handle slightly more than 300 x 300 (longitude x longitude) grid points for 3 variables in one year. You can reduce the spatial grid points in exchange for longer time period and more variables. A 300 by 300 spatial grid can be:

Subsetting longitude and latitude grid is available through long_range and lat_range if the NetCDF file has finer resolution than needed.

dt <- as_cubble(raw, vars = c("q", "z"),
                long_range = seq(-180, 180, 1),
                lat_rnage = seq(-90, -5, 1))
dt
#> # cubble:   id [6765]: nested form
#> # bbox:     [113, -53, 153, -12]
#> # temporal: time [dttm], q [dbl], z [dbl]
#>       id  long   lat ts              
#>    <int> <dbl> <dbl> <list>          
#>  1     1   113   -12 <tibble [6 × 3]>
#>  2     2   114   -12 <tibble [6 × 3]>
#>  3     3   115   -12 <tibble [6 × 3]>
#>  4     4   116   -12 <tibble [6 × 3]>
#>  5     5   117   -12 <tibble [6 × 3]>
#>  6     6   118   -12 <tibble [6 × 3]>
#>  7     7   119   -12 <tibble [6 × 3]>
#>  8     8   120   -12 <tibble [6 × 3]>
#>  9     9   121   -12 <tibble [6 × 3]>
#> 10    10   122   -12 <tibble [6 × 3]>
#> # … with 6,755 more rows