The study and conservation of the natural world relies on detailed information about the distributions, abundances, environmental associations, and population trends of species over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global community science bird monitoring program administered by The Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use machine learning to fill spatiotemporal gaps, using local land cover descriptions derived from NASA MODIS and other remote sensing data, while controlling for biases inherent in species observations collected by community scientists.
This data set provides estimates of the full annual cycle
distributions, abundances, and environmental associations for
r
scales::comma(nrow(ebirdst::ebirdst_runs))` species for
the year 2020. For each species, distribution and abundance estimates
are available for all 52 weeks of the year across a regular grid of
locations that cover the globe at a resolution of 2.96 km X 2.96 km.
Variation in detectability associated with the search effort is
controlled by standardizing the estimates as the expected occurrence
rate and count of the species on a 1 hour, 1 km checklist by an expert
eBird observer at the optimal time of day for detecting the species.
To describe how each species is associated with features of its local environment, estimates of the relative importance of each remotely sensed variable (e.g. land cover, elevation, etc), are available throughout the year at a monthly temporal and regional spatial resolution. Additionally, to assess estimate quality, we provide upper and lower confidence bounds for abundance estimates and regional-seasonal scale validation metrics for the underlying statistical models. For more information about the data products see the FAQ and summaries. See Fink et al. (2019) for more information about the analysis used to generate these data.
Data access is granted through an Access Request Form at: https://ebird.org/st/request. Access with this form generates a key to be used with this R package and is provided immediately (as long as commercial use is not requested). Our terms of use have been designed to be quite permissive in many cases, particularly academic and research use. When requesting data access, please be sure to carefully read the terms of use and ensure that your intended use is not restricted.
After completing the Access Request Form, you will be provided a
Status and Trends access key, which you will need when downloading data.
To store the key so the package can access it when downloading data, use
the function set_ebirdst_access_key("XXXXX")
, where
"XXXXX"
is the access key provided to you. Restart
R after setting the access key.
Throughout the package vignettes, a simplified example dataset is used consisting of Yellow-bellied Sapsucker in Michigan. This dataset is designed to be small for faster download and is accessible without a key. The following will download the example dataset to you computer:
library(ebirdst)
library(raster)
library(sf)
library(dplyr)
# download a simplified example dataset for Yellow-bellied Sapsucker in Michigan
<- ebirdst_download(species = "example_data") path
By default, ebirdst_download()
downloads data to a
sensible persistent directory on your computer. You can see what that
directory is with the function ebirdst_data_dir()
. You can
change the default download location to a directory of your choosing by
setting the environment variable EBIRDST_DATA_DIR
, for
example by calling usethis::edit_r_environ()
and adding a
line such as
EBIRDST_DATA_DIR=/custom/download/directory/
.
To download the data package for a species, provide the species code,
English common name, or scientific name to
ebirdst_download()
. Note that data for all other species
requires an API key to access. IMPORTANT: after downloading a
data package, do not change the file structure or rename files. Doing so
will prevent you from being able to work with the data in
R.
The full list of the species with data available for download is
available in the data frame ebirdst_runs
.
glimpse(ebirdst_runs)
which is included in this package. If you’re working in RStudio, you
can use View()
to interactively explore this data frame.
You can also consult the Status and Trends
landing page to see the full list of species.
All species go through a process of expert human review prior to
being released. The ebirdst_runs
data frame also contains
information from this review process. Reviewers assess each of the four
seasons: breeding, non-breeding, pre-breeding migration, and
post-breeding migration. Resident (i.e., non-migratory) species are
identified by having TRUE
in the resident
column of ebirdst_runs
, and these species are assessed
across the whole year rather than seasonally. ebirdst_runs
contains two important pieces of information for each season: a
quality rating and seasonal dates.
The seasonal dates define the weeks that fall within each season. Breeding and non-breeding season dates are defined for each species as the weeks during those seasons when the species’ population does not move. For this reason, these seasons are also described as stationary periods. Migration periods are defined as the periods of movement between the stationary non-breeding and breeding seasons. Note that for many species these migratory periods include not only movement from breeding grounds to non-breeding grounds, but also post-breeding dispersal, molt migration, and other movements.
Reviewers also examine the model estimates for each season to assess the amount of extrapolation or omission present in the model, and assign an associated quality rating ranging from 0 (lowest quality) to 3 (highest quality). Extrapolation refers to cases where the model predicts occurrence where the species is known to be absent, while omission refers to the model failing to predict occurrence where a species is known to be present.
A rating of 0 implies this season failed review and model results should not be used at all for this period. Ratings of 1-3 correspond to a gradient of more to less extrapolation and/or omission, and we often use a traffic light analogy when referring to them:
eBird Status and Trends data packages are identified by the 6-letter
eBird species code (e.g. yepsap
for Yellow-bellied
Sapsucker) and the Status and Trends estimation year (2020 for this
version of the R package). They are stored within sub-directories that
correspond to these variables, and you can get the path to this
directory with:
# for non-example data use the species code or name instead of "example_data"
<- get_species_path("example_data") path
Within this data package directory, the following files and directories will be present:
weekly/
: a directory containing weekly estimates of
occurrence, count, relative abundance, and percent of population on a
regular grid in GeoTIFF format at three resolutions. See below for more
details.seasonal/
: a directory containing seasonal estimates of
occurrence, count, relative abundance, and percent of population on a
regular grid in GeoTIFF format at three resolutions. These are derived
from the corresponding weekly raster data. Dates defining the boundary
of each season are set on a species-specific basis by an expert reviewer
familiar with the species. These dates are available in the
ebirdst_runs
data frame. Only seasons that passed the
expert review process are included. See below for more details.ranges/
: a directory containing GeoPackages storing
range boundary polygons. See below for more details.stixel_summary.db
: an SQLite database containing
information on habitat associations, including predictor importance (PI)
and partial dependence (PD) estimates.predictions.db
: model predictions for a test dataset
held out of the model fitting. These predictions are used for
calculating predictive performance metrics (PPMs) using the
ebirdst_ppms()
function.config.json
: run-specific parameters, mostly for
internal use, but also containing useful parameters for mapping the
abundance data. These parameters can be loaded with
load_fac_map_parameters()
.The vignette will cover the raster and range data.
The core raster data products are the weekly estimates of occurrence,
count, relative abundance, and percent of population. These are all
stored in the widely used GeoTIFF raster format, and we refer to them as
“weekly cubes” (e.g. the “weekly abundance cube”). All cubes have 52
weeks and cover the entire globe, even for species with ranges only
covering a small region. They come with areas of predicted and assumed
zeroes, such that any cells that are NA
represent areas
where we didn’t produce model estimates.
All estimates are the median expected value for a 1km, 1 hour eBird Traveling Count by an expert eBird observer at the optimal time of day and for optimal weather conditions to observe the given species.
occurrence
: the expected
probability of encountering a species.count
: the expected count of a
species, conditional on its occurrence at the given location.abundance
: the
expected relative abundance of a species, computed as the product of the
probability of occurrence and the count conditional on occurrence. In
addition to the median relative abundance, upper and lower confidence
intervals (CIs) are provided, defined at the 10th and 90th quantile of
relative abundance, respectively.precent-population
: the percent of the total relative
abundance within each cell. This is a derived product calculated by
dividing each cell value in the relative abundance raster by the sum of
all cell valuesAll predictions are made on a standard 2.96 x 2.96 km global grid, however, for convenience lower resolution GeoTIFFs are also provided, which are typically much faster to work with. However, note that to keep file sizes small, the example dataset only contains low resolution data. The three resolutions are:
hr
): the native 2.96 km resolution
datamr
): the hr
data
aggregated by a factor of 3 in each direction resulting in a resolution
of 8.89 kmlr
): the hr
data
aggregated by a factor of 9 in each direction resulting in a resolution
of 26.7 kmThe weekly cubes use the following naming convention:
weekly/<species_code>_<product>_<metric>_<resolution>_<year>.tif
where metric
is typically median
, except
for the relative abundance CIs, which use lower
and
upper
. The function load_raster()
is used to
load these data into R and takes arguments for product
,
metric
and resolution
. For example,
# weekly, low res, median occurrence
<- load_raster(path, product = "occurrence", resolution = "lr")
occ_lr
occ_lr# use parse_raster_dates() to get the date associated which each raster layer
parse_raster_dates(occ_lr)
# weekly, low res, abundance confidence intervals
<- load_raster(path, product = "abundance", metric = "lower",
abd_lower resolution = "lr")
<- load_raster(path, product = "abundance", metric = "upper",
abd_upper resolution = "lr")
The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. This projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping since it introduces significant distortion.
The seasonal raster estimates are provided for the same set of
products and at the same three resolutions as the weekly estimates.
They’re derived from the weekly data by taking the cell-wise mean or max
across the weeks within each season. The seasonal boundary dates are
defined through a process of expert review of each species, and are
available in the data frame ebirdst_runs
. Each season is
also given a quality score from 0 (fail) to 3 (high quality), and
seasons with a score of 0 are not provided.
The seasonal GeoTIFFs use the following naming convention:
seasonal/<species_code>_<product>_seasonal_<metric>_<resolution>_<year>.tif
where metric
is either mean
or
max
. The function
load_raster(period = "seasonal")
is used to load these data
into R and takes arguments for product
, metric
and resolution
. For example,
# seasonal, low res, mean relative abundance
<- load_raster(path, product = "abundance",
abd_seasonal_mean period = "seasonal", metric = "mean",
resolution = "lr")
# season that each layer corresponds to
names(abd_seasonal_mean)
# just the breeding season layer
"breeding"]]
abd_seasonal_mean[[
# seasonal, low res, max occurrence
<- load_raster(path, product = "occurrence",
occ_seasonal_max period = "seasonal", metric = "max",
resolution = "lr")
Finally, as a convenience, the data products include year-round
rasters summarizing the mean or max across all weeks that fall within a
season that passed the expert review process. These can be accessed
similarly to the seasonal products, just with
period = "full-year"
instead. For example, these can layers
be used in conservation planning to assess the most important sites
across the full range and full annual cycle of a species.
# full year, low res, maximum relative abundance
abd_fy_max <- load_raster(path, product = "abundance",
period = "full-year", metric = "max",
resolution = "lr")
Seasonal range polygons are defined as the boundaries of non-zero
seasonal relative abundance estimates, which are then (optionally)
smoothed to produce more aesthetically pleasing polygons using the
smoothr
package. They are provided in the widely used
GeoPackage format, with file naming convention:
ranges/<species_code>_range_seasonal_<raw/smooth>_<resolution>_<year>.gpkg
where raw
refers to the polygons derived directly from
the raster data and smooth
refers to the smoothed polygons.
Note that only low and medium resolution ranges are provided. These
range polygons can be loaded with load_ranges()
:
# seasonal, low res, smoothed ranges
<- load_ranges(path, resolution = "lr")
ranges
ranges
# subset to just the breeding season range using dplyr
<- filter(ranges, season == "breeding") range_breeding
The two SQLite database contained within the species data package,
provide tabular data with information about modeled relationships
between observations and the ecological covariates, as well as data that
can be used to assess predictive performance. Note that these SQLite
databases are quite large (many GBs in size) and are therefore note
downloaded by default. To access these tabular data, you must use the
parameter tifs_only = FALSE
in the
ebirdst_download()
function.
Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. https://doi.org/10.1002/eap.2056