Introduction
This vignette provides an overview of assay registration process for
the tcpl package. The definition of an
“assay” is, for the purposes of this package, broken into:
assay_source – the vendor/origination of the
data
assay – the procedure to generate
the component data
assay_component – the
raw data readout(s)
assay_component_endpoint – the normalized
component data
Hierarchical Structure
Assay source, assay, assay component, and assay endpoint are registered via tcpl scripting into a collection of tables. These assay element tables broadly describe who conducted the assay, what platform was used, what was being measured (raw readout), and how the measurement was interpreted (normalized component data). A hierarchical structure of the assay elements is as follows: assay source > assay > assay component > assay component endpoint.
As one moves down the hierarchy, each additional level has a ‘one-to-many’ relationship with the previous level. For example, an assay component can have multiple assay endpoints, but an assay endpoint can derive only from a single assay component. From a database v3.4 snapshot taken in December 2020, InvitroDB supported 32 assay sources, 763 assays, 2074 components, and 2780 endpoints.
Minimum Required Fields
Throughout the tcpl R package, the levels of assay hierarchy are defined and referenced by their auto-incremented primary keys in the tcpl database: \(\mathit{asid}\) (assay source ID), \(\mathit{aid}\) (assay ID), \(\mathit{acid}\) (assay component ID), and \(\mathit{aeid}\) (assay endpoint ID). These abbreviations mirror the abbreviations for the identifiers (ids) with “nm” in place of “id” in the abbreviations, e.g. assay_component_name is abbreviated \(\mathit{acnm}\).
The tcpl R package provides three functions for adding new data:
tcplRegister – to register a new assay
element or chemical
tcplUpdate – to
change or add additional information for existing assay or chemical ids
tcplWriteLvl0 – to load formatted source
data
All processing occurs by assay component or assay endpoint, depending on the processing type (single-concentration or multiple-concentration) and level. No data is stored at the assay or assay source level. The “assay” and “assay_source” tables store annotations to help in the processing and down-stream understanding of the data. Additional details for registering each assay element and updating annotations are provided below. In addition to each assay element’s id, the minimal registration fields in order to ‘pipeline’ are: \(\mathit{assay\_source\_name}\) (\(\mathit{asnm}\)), \(\mathit{assay\_name}\) (\(\mathit{anm}\)), \(\mathit{assay\_footprint}\), \(\mathit{assay\_component\_name}\) (\(\mathit{acnm}\)), \(\mathit{assay\_component\_endpoint\_name}\) (\(\mathit{aenm}\)), and \(\mathit{normalized\_data\_type}\).
Naming Conventions
Assay Source
Assay source refers to the vendor or origination of the data. To register an assay source, an unused \(\mathit{asid}\) must be selected to prevent overwriting of existing data. When adding a new assay source, this should be an abbreviation, as subsequent levels will build on this assay source name.
#tcplLoadAsid()
#tcplRegister(what = "asid", flds = list(asid = 1, asnm = "Tox21"))
The tcplRegister function takes the abbreviation for \(\mathit{assay\_source\_name}\), but the function will also take the unabbreviated form. The same is true of the tcplLoadA- functions, which load the information for the assay annotations stored in the database.
Assay
Assay refers to the procedure, conducted by some vendor, to generate the component data. To register an assay, an \(\mathit{asid}\) must be provided to map the assay to the correct assay source. One source may have many assays. To ensure consistency of the naming convention, first check how other registered assays within the assay source were conducted and named. The assay names follow an abbreviated and flexible naming convention of Source_Assay. Notable assay design features to describe the assay include:
- Technology (i.e., detection technology),
- Format (e.g., organism, tissue, cell short name, or cell free component source name),
- Target (i.e., intended target, intended target family, gene), or
- Objective aspects (e.g., timepoint or assay footprint).
The most distinguishing features will be selected to create a succinct assay name. Variation depends on the assay itself as well as other assays provided by the vendor. If multiple features are needed to describe an assay, order will be based on relative importance in describing the assay and the assay’s relation to other assays provided by the vendor to limit confusion. “Source_Technology_Format_Target” is a commonly used naming order. However, if one target is screened on different assay platforms by the vendor, “Source_Target_Technology” is a more appropriate naming convention. This is the case for the Tox21 assays. Additional features may be relevant, including agonist or antagonist mode, or “Follow-up” if the assay is a secondary specificity assay. Conversely, some assays utilize a cell-based format to screen a functional profile of targets. These assays follow a naming convention, Source_Format, where specific target information is defined at the component and endpoint level. Bioseek and Attagene are sources that provide cell-based assays. Considering the diversity of the data sources and high throughput assays in ToxCast, a flexible naming approach is best used in conjunction with subject matter expert discretion.
#tcplLoadAid(what = "asid", val=1)
#tcplRegister(what = "aid",
# flds = list(asid = 1,
# anm = "TOX21_ERa_BLA_Agonist",
# assay_footprint = "1536 well"))
When registering an assay (\(\mathit{aid}\)), the user must give an \(\mathit{asid}\) to map the assay to the correct assay source. Registering an assay, in addition to an assay_name (\(\mathit{anm}\)) and \(\mathit{asid}\), requires \(\mathit{assay\_footprint}\). The \(\mathit{assay\_footprint}\) field is used in the assay plate visualization functions (discussed later) to define the appropriate plate size. The \(\mathit{assay\_footprint}\) field can take most string values, but only the numeric value will be extracted, e.g. the text string “hello 384” would indicate to draw a 384-well microtitier plate. Values containing multiple numeric values in \(\mathit{assay\_footprint}\) may cause errors in plotting plate diagrams.
Assay Component Registration
Assay component, or “component” for short, describes the raw data readouts. Like the previous level, one assay may have many components. To register an assay component and create an \(\mathit{acid}\), an \(\mathit{aid}\) must be provided to map the component to the correct assay. The assay component name will build on its respective assay name, to describe the specific feature being measured in each component. If there is only one component, the component name can be the same as the assay name. If there are multiple components measured in an assay, understanding the differences, and how one component may relate to another within an assay, are important naming considerations to prevent confusion. Assay component names will usually follow the naming convention of Source_Assay_Component, where “Component” is a brief description of what is being measured.
#tcplLoadAcid(what = "asid", val=1, add.fld=c("aid", "anm"))
#tcplRegister(what = "acid",
# flds = list(aid = 1,
# acnm = "TOX21_ERa_BLA_Agonist_ratio"))
The final piece of assay information needed is the assay component source name (abbreviated \(\mathit{acsn}\)), stored in the “assay_component_map” table. The assay component source name is intended to simplify level 0 pre-processing by defining unique character strings (concatenating information if necessary) from the source files that identify the specific assay components. An assay component can have multiple \(\mathit{acsn}\) values, but an \(\mathit{acsn}\) must be unique to one assay component. Assay components can have multiple \(\mathit{acsn}\) values to minimize the amount of data manipulation required (and therefore potential errors) during the level 0 pre-processing if assay source files change or are inconsistent. The unique character strings (\(\mathit{acsn}\)) get mapped to \(\mathit{acid}\).
#tcplRegister(what = "acsn", flds = list(acid = 1, acsn = "TCPL-MC-Demo"))
Assay Component Endpoint Registration
Assay component endpoint, or “endpoint” for short, represents the normalized component data. To register an endpoint and create an \(\mathit{aeid}\), an \(\mathit{acid}\) must be provided to map the endpoint to the correct component. In past tcpl versions, each component could have up to two endpoints therefore endpoint names would express directionality (*_up/_down*). tcpl_v3 allows bidirectional fitting to capture both the gain and loss of signal. Therefore for tcpl_v3 onward, the endpoint name will usually be the same as the component name.
#tcplLoadAeid(fld="asid",val=1, add.fld = c("aid", "anm", "acid", "acnm"))
#tcplRegister(what = "aeid",
# flds = list(acid = 1,
# aenm = "TOX21_ERa_BLA_Agonist_ratio",
# normalized_data_type = "percent_activity",
# export_ready = 1,
# burst_assay = 0,
# fit_all = 0))
Notice registering an assay endpoint also requires the \(\mathit{normalized\_data\_type}\) field. The \(\mathit{normalized\_data\_type}\) field gives some default values for plotting. Currently, the package supports three \(\mathit{normalized\_data\_type}\) values: (1) percent_activity, (2) log2_fold_induction, and (3) log10_fold_induction. Any other values will be treated as “percent_activity.”
The other three additional fields when registering an assay endpoint do not have to be explicitly defined when working in the MySQL environment and will default to the values given above. All three fields represent Boolean values (1 or 0, 1 being TRUE ). The \(\mathit{export\_ready}\) field indicates (1) the data is done and ready for export or (0) still in progress. The \(\mathit{burst\_assay}\) field is specific to multiple-concentration processing and indicates (1) the assay endpoint is included in the burst distribution calculation or (0) not. The \(\mathit{fit\_all}\) field is specific to multiple-concentration processing and indicates (1) the package should try to fit every concentration series, or (0) only attempt to fit concentration series that show evidence of activity.
Naming Revision
There are circumstances where assay, assay component, and assay endpoint names change. The \(\mathit{aid}\), \(\mathit{acid}\), and \(\mathit{aeid}\) are considered more stable in the database, and these auto-incremented keys should not change. To revise naming for assay elements, the correct id must be specified in the tcplUpdate statement to prevent overwriting data.
# tcplUpdate(what = "acid",
# flds = list(aid = 1,
# acnm = "TOX21_ERa_BLA_Agonist_ratio"))
Reasons for name changes could include:
- Feedback from subject matter experts or assay data generators;
- Clarifications on cell line and cell line drift;
- Addition of new assay data that makes the old naming convention insufficient, such as antagonist assays run with different concentrations of an agonist; or
- Laboratory or Center reorganizations.
Thus, users should be advised that while assay naming is used to infer information about the biology of the assay, assay naming will change over time to reflect progress in building ToxCast as a data resource.