The safedata
R package is designed to discover and work with data using the formatting and indexing API designed for the Stability of Altered Forest Ecosystems (SAFE) Project. The SAFE Project is one the largest ecological experiments in the world, investigating the effects of human activities on biodiversity and ecosystem function in the Malaysian rainforest.
Research conducted at the SAFE Project encompasses expertise from many disciplines and institutions, each running interlinked projects that help to develop our understanding of ecology within changing environments. Data from the many activities conducted through the SAFE Project are curated and published to a community repository hosted at Zenodo.The safedata
package enables researchers to quickly and easily interface with these datasets.
All researchers working at the SAFE Project are required to submit their project data to the SAFE Zenodo repository. There are the following 3 stages to the publication process, with further details provided below.
Datasets are submitted as Microsoft Excel spreadsheets, containing the following worksheets:
The formatting details for each worksheet are described here: https://safedata-validator.readthedocs.io/en/latest/data_format/overview.html
Once a dataset has been formatted, researchers can submit the dataset to the SAFE Project website for validation and publication.
Submitted datasets are validated using the Python program (safedata_validator
)[https://github.com/ImperialCollegeLondon/safedata_validator]. This checks that the dataset format is correct and that all the required metadata is provided and consistent. When a dataset fails validation, a report is returned to the submitter to help revise the dataset, otherwise validated datasets are published to Zenodo.
Zenodo is a scientific data repository backed by the CERN. Zenodo provides data communities that allow all of the SAFE Project datasets to be collated into a single collection ((https://zenodo.org/communities/safe/). The publication process uses the metadata provided in the submitted file to automatically create a detailed description of the dataset. Zenodo also issues DOIs for published datasets and provides versioning and access control :
Zenodo uses a DOI versioning system that allows sets of dataset records to be grouped. When a dataset is published on Zenodo for the first time, two DOIs are registered:
Subsequent versions of an upload are then logged under a new DOI. This means that multiple versions of a dataset can be stored and referenced clearly. The concept DOI refers to all versions of the dataset (and the DOI link resolves to the most recent version), while the version DOI refers to a particular instance.
Datasets published to the SAFE Project community on Zenodo can have one of the following three access status:
In practice, restricted data is little used and a fourth ‘Closed’ status is not accepted for SAFE datasets. Note that a particular dataset concept may have a versions with a mixture of access statuses.
The Zenodo website provides the ability to search the text of record descriptions, including a search tool specifically for the SAFE Project community. However, these searches are unstructured and do not cover all of the metadata contained within published datasets. The SAFE Project website therefore maintains a search API that allows structured queries to be performed on the following:
dates
: the start and endpoints of data collection within a dataset,fields
: the text and field type of fields within indidual data tables within a dataset,authors
: the authors of datasets,text
: a free text search of dataset, worksheet and field titles and descriptions.taxa
: the taxa included in a dataset, andspatial
: matching datasets by sampling location.In addition, the API provides a record
endpoint that allows the full record metadata to be downloaded in JSON format.
The safedata
package stores downloaded datasets, record metadata and key index files within a data directory, which is used as a local repository of the datasets used by an individual researcher. The structure of the directory is critical to the operation of the package and the datasets themselves are under version control: users must not change the structure or the file contents of this directory.
The directory structure is as follows: three index files are stored in the root of the directory, which will also contain a folder named with the concept id number of each dataset that has been downloaded. These concept folders will then contain at least one subfolder giving named with the record number of a downloaded dataset. These subfolders will then contain the record files downloaded from Zenodo and a JSON format file containing the dataset metadata. The directory also contains a JSON format file that that records the base URL of the dataset index website.
For example:
gazetteer.geojson
index.json
location_aliases.csv
url.json
1400561/1400562/1400562.json
1400561/1400562/Psomas_Ant_Pselaphine_SAFE_dataset.xlsx