This article shows you how to create a cubble from data in the wild. You should have already seen an example of constructing a cubble from a tibble in the README page and here are more examples that construct a cubble from:
List column can be a useful structure when querying data since you can first create a row-wise data frame with all the metadata and then query the data as a nested list. We will show how to convert this output into a cubble in a small example with five weather stations close to Sydney, Australia. Here is the metadata of the stations:
syd#> # A tibble: 5 × 6
#> id lat long elev name wmo_id
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ASN00066062 -33.9 151. 39 sydney (observatory hill) 94768
#> 2 ASN00066037 -33.9 151. 6 sydney airport amo 94767
#> 3 ASN00066194 -33.9 151. 3 canterbury racecourse aws 94766
#> 4 ASN00066124 -33.8 151. 55 parramatta north (masons drive 94764
#> 5 ASN00066059 -33.7 151. 199 terrey hills aws 94759
Temporal variables can be queried by
rnoaa::meteo_pull_monitors()
with the date range and
variable supplied. Here we turn the stations
data into a
rowwise data frame and then query the climate variables as a
list-column, ts
:
<- syd %>%
raw rowwise() %>%
mutate(ts = list(rnoaa::meteo_pull_monitors(id,
date_min = "2020-01-01",
date_max = "2020-12-31",
var = c("PRCP", "TMAX", "TMIN")) %>% select(-id)))
raw
#> # A tibble: 5 × 7
#> # Rowwise:
#> id lat long elev name wmo_id ts
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list>
#> 1 ASN00066062 -33.9 151. 39 sydney (observatory hill) 94768 <tibble>
#> 2 ASN00066037 -33.9 151. 6 sydney airport amo 94767 <tibble>
#> 3 ASN00066194 -33.9 151. 3 canterbury racecourse aws 94766 <tibble>
#> 4 ASN00066124 -33.8 151. 55 parramatta north (masons drive 94764 <tibble>
#> 5 ASN00066059 -33.7 151. 199 terrey hills aws 94759 <tibble>
A cubble can then be created by supplying:
id
as the key to identify each station,date
as the index to identify time, andcoords
to identify the spatial coordinates of each
station.<- raw %>%
syd_climate as_cubble(key = id, index = date, coords = c(long, lat))
syd_climate#> # cubble: id [5]: nested form
#> # bbox: [151.01, -33.95, 151.23, -33.69]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#> id lat long elev name wmo_id ts
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list>
#> 1 ASN00066062 -33.9 151. 39 sydney (observatory hill) 94768 <tibble>
#> 2 ASN00066037 -33.9 151. 6 sydney airport amo 94767 <tibble>
#> 3 ASN00066194 -33.9 151. 3 canterbury racecourse aws 94766 <tibble>
#> 4 ASN00066124 -33.8 151. 55 parramatta north (masons drive 94764 <tibble>
#> 5 ASN00066059 -33.7 151. 199 terrey hills aws 94759 <tibble>
If you have already got a tsibble
object, with key and
index registered, only coords
needs to be specified to
create a cubble:
dt#> # A tsibble: 1,830 x 10 [1D]
#> # Key: id [5]
#> id lat long elev name wmo_id date prcp tmax tmin
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <date> <dbl> <dbl> <dbl>
#> 1 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-01 0 31.9 15.3
#> 2 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-02 0 24.9 16.4
#> 3 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-03 6 23.2 13
#> 4 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-04 0 28.4 12.4
#> 5 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-05 0 35.3 11.6
#> 6 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-06 0 34.8 13.1
#> 7 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-07 0 32.8 15.1
#> 8 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-08 0 30.4 17.4
#> 9 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-09 0 28.7 17.3
#> 10 ASN00009021 -31.9 116. 15.4 perth airp… 94610 2020-01-10 0 32.6 15.8
#> # … with 1,820 more rows
%>% as_cubble(coords = c(long, lat))
dt #> # cubble: id [5]: nested form
#> # bbox: [115.97, -32.94, 133.55, -12.42]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#> id lat long elev name wmo_id ts
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list>
#> 1 ASN00009021 -31.9 116. 15.4 perth airport 94610 <tbl_ts [366 × 4]>
#> 2 ASN00010311 -31.9 117. 179 york 94623 <tbl_ts [366 × 4]>
#> 3 ASN00010614 -32.9 117. 338 narrogin 94627 <tbl_ts [366 × 4]>
#> 4 ASN00014015 -12.4 131. 30.4 darwin airport 94120 <tbl_ts [366 × 4]>
#> 5 ASN00015131 -17.6 134. 220 elliott 94236 <tbl_ts [366 × 4]>
Notice here each element of the list-column ts
is of
class tbl_ts
and you are free to apply your favourite
function on the tsibble class :)
Sometimes, you may get the spatial and temporal data from different sources and they live in two separate tables with a linking variable:
# a spatial sheet
::stations
cubble#> # A tibble: 5 × 6
#> id lat long elev name wmo_id
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ASN00009021 -31.9 116. 15.4 perth airport 94610
#> 2 ASN00010311 -31.9 117. 179 york 94623
#> 3 ASN00010614 -32.9 117. 338 narrogin 94627
#> 4 ASN00014015 -12.4 131. 30.4 darwin airport 94120
#> 5 ASN00015131 -17.6 134. 220 elliott 94236
# a temporal sheet
::climate
cubble#> # A tibble: 1,830 × 5
#> id date prcp tmax tmin
#> <chr> <date> <dbl> <dbl> <dbl>
#> 1 ASN00009021 2020-01-01 0 31.9 15.3
#> 2 ASN00009021 2020-01-02 0 24.9 16.4
#> 3 ASN00009021 2020-01-03 6 23.2 13
#> 4 ASN00009021 2020-01-04 0 28.4 12.4
#> 5 ASN00009021 2020-01-05 0 35.3 11.6
#> 6 ASN00009021 2020-01-06 0 34.8 13.1
#> 7 ASN00009021 2020-01-07 0 32.8 15.1
#> 8 ASN00009021 2020-01-08 0 30.4 17.4
#> 9 ASN00009021 2020-01-09 0 28.7 17.3
#> 10 ASN00009021 2020-01-10 0 32.6 15.8
#> # … with 1,820 more rows
When created out of separate tables, as_cubble()
will
check on the match of linking variable in the spatial and temporal
table.
as_cubble(list(spatial = cubble::stations, temporal = cubble::climate),
key = id, index = date, coords = c(long, lat))
#> # cubble: id [5]: nested form
#> # bbox: [115.97, -32.94, 133.55, -12.42]
#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
#> id lat long elev name wmo_id ts
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list>
#> 1 ASN00009021 -31.9 116. 15.4 perth airport 94610 <tibble [366 × 4]>
#> 2 ASN00010311 -31.9 117. 179 york 94623 <tibble [366 × 4]>
#> 3 ASN00010614 -32.9 117. 338 narrogin 94627 <tibble [366 × 4]>
#> 4 ASN00014015 -12.4 131. 30.4 darwin airport 94120 <tibble [366 × 4]>
#> 5 ASN00015131 -17.6 134. 220 elliott 94236 <tibble [366 × 4]>
Messages will emit to notify the unmatched, if detected. This would be useful to catch some sites with only spatial/ temporal information and the slight mismatch in character linking variable.
NetCDF (Network Common Data Form) is a commonly used data format in the climatology community to deliver global mapping of atmosphere, ocean, and land. Here we provide some general information regarding this format before showing an example of creating a cubble out of NetCDF data.
NetCDF data has two main components:
Attributes are usually associated with dimension and variable in the
NetCDF format data. A few packages in R exists for manipulating NetCDF
data and this includes a high-level R interface: ncdf4
, a
low-level interface that calls C interface: RNetCDF
, and a
tidyverse implementation: tidync
. Here let’s take a look at
a NetCDF data:
<- system.file("ncdf/era5-pressure.nc", package = "cubble")
path <- ncdf4::nc_open(path)
raw
raw#> File /private/var/folders/_t/v9kjp3yn2k73wm16y_jlphgsbbd6_6/T/Rtmpda2m05/Rinst27fc69f8d93d/cubble/ncdf/era5-pressure.nc (NC_FORMAT_64BIT):
#>
#> 2 variables (excluding dimension variables):
#> short q[longitude,latitude,time]
#> scale_factor: 2.09848696659051e-11
#> add_offset: 3.16766314740189e-06
#> _FillValue: -32767
#> missing_value: -32767
#> units: kg kg**-1
#> long_name: Specific humidity
#> standard_name: specific_humidity
#> short z[longitude,latitude,time]
#> scale_factor: 0.154814177589917
#> add_offset: 306067.078842911
#> _FillValue: -32767
#> missing_value: -32767
#> units: m**2 s**-2
#> long_name: Geopotential
#> standard_name: geopotential
#>
#> 3 dimensions:
#> longitude Size:161
#> units: degrees_east
#> long_name: longitude
#> latitude Size:165
#> units: degrees_north
#> long_name: latitude
#> time Size:6
#> units: hours since 1900-01-01 00:00:00.0
#> long_name: time
#> calendar: gregorian
#>
#> 2 global attributes:
#> Conventions: CF-1.6
#> history: 2022-04-17 01:14:41 GMT by grib_to_netcdf-2.24.3: /opt/ecmwf/mars-client/bin/grib_to_netcdf -S param -o /cache/data8/adaptor.mars.internal-1650158081.486927-5264-6-90c05c05-ca42-4b7a-b6f8-d9617d0f0500.nc /cache/tmp/90c05c05-ca42-4b7a-b6f8-d9617d0f0500-adaptor.mars.internal-1650158080.1120768-5264-12-tmp.grib
In a NetCDF data, it is not the the actual data that gets directly
printed out, but the metadata. There are 2 variables and 3 dimensions in
this data and each is associated with a few attributes. In this data the
attributes for the two variables includes the scaling and offset
parameter, representation of the missing and its fill value, along with
its unit and names. In NetCDF, data is stored in its packed
value to save space and sometimes, you will need to use a formula
like \(\text{unpacked value} = \text{packed
value} \times \text{scale factor} + \text{add offset}\) to unpack
the data. Luckily, when reading in the NetCDF data with the
ncdf4
package, it has already unpack the data for you, so
no need to worry about the scaling and offset.
In principle, NetCDF data can store data with arbitrary variable,
dimension, and attribute and this will cause a chaos to generalise its
manipulation. Metadata convention
for climate and forecast (CF convention) is a guideline that has
been designed to standardise the format of NetCDF data. Thanks to the CF
convention, cubble
can now extract specific components as
per the CF convention to build a cubble from NetCDF data.
Cubble provides an as_cubble()
method to coerce the
ncdf4
class from the ncdf4
package into a
cubble. It maps each combination of longitude and latitude into an
id
as the key:
<- as_cubble(raw, vars = c("q", "z"))
dt
dt#> # cubble: id [26565]: nested form
#> # bbox: [113, -53, 153, -12]
#> # temporal: time [dttm], q [dbl], z [dbl]
#> id long lat ts
#> <int> <dbl> <dbl> <list>
#> 1 1 113 -12 <tibble [6 × 3]>
#> 2 2 113. -12 <tibble [6 × 3]>
#> 3 3 114. -12 <tibble [6 × 3]>
#> 4 4 114. -12 <tibble [6 × 3]>
#> 5 5 114 -12 <tibble [6 × 3]>
#> 6 6 114. -12 <tibble [6 × 3]>
#> 7 7 114. -12 <tibble [6 × 3]>
#> 8 8 115. -12 <tibble [6 × 3]>
#> 9 9 115 -12 <tibble [6 × 3]>
#> 10 10 115. -12 <tibble [6 × 3]>
#> # … with 26,555 more rows
The memory limit with NetCDF data in cubble depends on longitude grid point x latitude grid point x time grid point x number of variable. Cubble can handle slightly more than 300 x 300 (longitude x longitude) grid points for 3 variables in one year. You can reduce the spatial grid points in exchange for longer time period and more variables. A 300 by 300 spatial grid can be:
Subsetting longitude and latitude grid is available through
long_range
and lat_range
if the NetCDF file
has finer resolution than needed.
<- as_cubble(raw, vars = c("q", "z"),
dt long_range = seq(-180, 180, 1),
lat_rnage = seq(-90, -5, 1))
dt#> # cubble: id [6765]: nested form
#> # bbox: [113, -53, 153, -12]
#> # temporal: time [dttm], q [dbl], z [dbl]
#> id long lat ts
#> <int> <dbl> <dbl> <list>
#> 1 1 113 -12 <tibble [6 × 3]>
#> 2 2 114 -12 <tibble [6 × 3]>
#> 3 3 115 -12 <tibble [6 × 3]>
#> 4 4 116 -12 <tibble [6 × 3]>
#> 5 5 117 -12 <tibble [6 × 3]>
#> 6 6 118 -12 <tibble [6 × 3]>
#> 7 7 119 -12 <tibble [6 × 3]>
#> 8 8 120 -12 <tibble [6 × 3]>
#> 9 9 121 -12 <tibble [6 × 3]>
#> 10 10 122 -12 <tibble [6 × 3]>
#> # … with 6,755 more rows