Introduction

D4TAlink is a collection of tools, methods and processes for the management of analysis workflows. These lightweight solutions facilitate the structuring of R&D activities in compliance with FAIR and ALCOA principles.

D4TAlink.light is an R packages with core functions to implement D4TAlink's processes.

  1. FAIR principles: Jacobsen et al., 2017 (doi:10.1162/dint_r_00024)
  2. ALCOA principles: Food & Drug Administration, 2018 (Data Integrity and Compliance With Drug CGMP - Questions and Answers Guidance for Industry).

Instalation

Install from CRAN:

install.packages("D4TAlink.light")

Install latest version from Bitbucket:

devtools::install_bitbucket("SQ4/d4talink.light",subdir="D4TAlink.light")

Note that you may need to install:

Tutorial

Setting base parameters

Once the R package loaded, user must set D4TAlink's global parameters, namely the name of the data analyst and the name of the study sponsor.

library(D4TAlink.light)

setTaskAuthor("Doe Johns")
setTaskSponsor("mySponsor")

The location of the data file repository, must then be defined. Indeed, D4TAlink manages data and information in flat files within a structured directory tree.

setTaskRoot(file.path(tempdir(),"D4TAlink_example001"),dirCreate=TRUE)

As described below, other parameters can be defined.

setTaskRmdTemplate("/SOME/WHERE/my.Rmd")
setTaskStructure(pathsDefault)

Note that D4TAlink's parameters can be set via the .Renviron file located in the system home directory.

D4TAlink_author="Dow Johns"
D4TAlink_sponsor="CompanyA"
D4TAlink_root="/SOME/WHERE/D4TAlink_example001"
D4TAlink_rmdtempl="/SOME/WHERE/my.Rmd"
D4TAlink_rscripttempl="/SOME/WHERE/my.R"
D4TAlink_pathgen="pathsDefault"

Creating an analysis task

A data analysis workflow typically comprises a succession of distinct analyses task. A typical analysis workflow would comprise the following tasks:

  1. exporting and loading of source data,
  2. data transformation (e.g., normalization and imputation),
  3. descriptive statistics, and
  4. statistical modelling.

Coding these successive tasks using a single analysis script is a bad practice for multiple reasons. Firstly, the analysis scripts become lengthy and thus difficult to write, review and maintain. Further, this prevents code reuse and hinders project agility. Finally, this complexifies team collaboration on a single workflow.

D4TAlink define the 'analysis task' as a central concept. A data analysis workflow consisting of a succession of tasks that could be arborescent.

Each task is assigned to a work package, which is assigned to a project, and each project is assigned to a sponsor.

To create an analysis task in R use the following calls.

# make sure that the sopnsor was defined
setTaskSponsor("mySponsor")

# create a task
task <- initTask(project="myProject",
                 package="myPackage",
                 taskname=sprintf("%s_myTask",format(Sys.time(),"%Y%m%d")))

Accessing content of a task

Each task has it's own directory structure. The task contains storage for five types of data:

  1. output data: typically, the data produced by the script in for of excel files, graphic files, …,
  2. source data: local storage for the input data provided by third parties,
  3. analysis scripts: data analysis scripts in R, SAS, python, …,
  4. documentation: documentation of the analysis task,
  5. binary data: output data for the task stored in binary format for follow-up task.

The location of these data can be obtained using respectively the functions reportDir, datasourceDir, progDir, docDir, and binaryDir.

For traceability, the files within a task have specifically the format [TASK_NAME]_[DATA_TYPE].[EXTENSTION], where DATA_TYPE is a short string describing the content of the file, and EXTENSION the file tyle (e.g., pdf or xlsx). By convention TASK_NAME has a date as prefix with format %Y%m%d_, and DATA_TYPE does not contain underscores or dots, _ or ..

Reporting graphics in a task

To output a graphic file in the output directory of the task, use the following.

PDF

file <- pdfReport(task,c("plots",1),dim=c(100,100))
hist(rnorm(100))
dev.off()
openPDF(file)

PNG

file <- pngReport(task,c("plots",1),dim=c(300,300))
hist(rnorm(100))
dev.off()
print(file)

JPEG

file <- jpegReport(task,c("plots",1),dim=c(300,300))
hist(rnorm(100))
dev.off()
print(file)

Reporting tables in a task

To output data frames as an Excel file in the output directory of the task, use the following:

d <- list(letters=data.frame(a=LETTERS,b=letters,c=1:length(letters)),
          other=data.frame(a=1:3,b=11:13))
file <- saveReportXls(d,task,"tables")
print(file)

and to output in a text file:

file <- saveReportTable(d$letters,task,"tables")
print(file)

Transfering data from a task to another

Tasks each constituting an element in a stepwise process, data can be transferred from a task to the other. To do so, R objects must be stored by the parent task using the call saveBinary(object,task,"ojectType"). The child task may then load the data from the parent task using the call saveBinary(loadTask(...),"ojectType"). Here is an example.

Saving data in a parent task:

d <- list(letters=data.frame(a=LETTERS,b=letters,c=1:length(letters)),
          other=data.frame(a=1:3,b=11:13))
task <- initTask(project="myProject",
                 package="myPackage",
                 taskname="20220801_parentTask")
file <- saveBinary(d,task,"someData")
print(file)

Loading data in a child task:

task <- initTask(project="myProject",
                 package="myPackage",
                 taskname="20220801_childTask")
e <- readBinary(loadTask(task$project,task$package,"20220801_parentTask"),"someData")

Create and render task documentation

Documentation of a task is typically authored using R markdown files (Rmd). D4TAlink advises to have one Rmd file per task. D4TAlink.light provides functions to create and render these files.

Creation of an R markdown file from template:

file <- initTaskRmd(task)
print(file)

Rendering of the markdown file into the task documentation directory:

file <- renderTaskRmd(task) # may require having run 'tinytex::install_tinytex()'
openPDF(file)

Create a task R script

For some task an R script may also be needed. The task script can be created from the default template:

file <- initTaskRscript(task)
print(file)

Task archiving

D4TAlink has tools to archive and restore tasks. This enables, for instance, transferring tasks from one repository to another.

Archiving a task:

setTaskRoot(file.path(tempdir(),"D4TAlink_exampleFrom"),dirCreate=TRUE)
task <- initTask(project="myProject",
                 package="myPackage",
                 taskname="20220501_myTask")
file <- tempfile(fileext=".zip")
archiveTask(task,file)
print(reportDir(task))

Restore a task to a different archive:

setTaskRoot(file.path(tempdir(),"D4TAlink_exampleTo"),dirCreate=TRUE)
restoreTask(file)
newtask <- loadTask(project="myProject",
                    package="myPackage",
                    taskname="20220501_myTask")
print(reportDir(newtask))

Changing R markdown and script templates

The R markdown and script templates can be set using the functions setTaskRmdTemplate and setTaskRscriptTemplate as follows.

setTaskRmdTemplate("/SOME/WHERE/my.Rmd")
setTaskRscriptTemplate("/SOME/WHERE/my.R")

The available path generation functions are pathsDefault, pathsGLPG, and pathsPMS.

Further, the path path th the template can be set in the .Renviron file:

D4TAlink_rmdtempl="/SOME/WHERE/my.Rmd"
D4TAlink_rscripttempl="/SOME/WHERE/my.R"

Changing directory structure

The directory structure can be customized, by creating a directory using the command setTaskStructure as follows.

fun <- function(project,package,taskname,sponsor) {
  basePath <- file.path("%ROOT%",sponsor,project,package)
  paths <- list(
    root = "%ROOT%",
    datasrc = file.path(basePath, "raw", "data_source"),
    data = file.path(basePath, "output","adhoc",taskname),
    bin  = file.path(basePath, "output","adhoc",taskname,"bin"),
    code = file.path(basePath, "progs"),
    doc  = file.path(basePath, "docs"),
    log  = file.path(basePath, "output","log")
  )
}

setTaskStructure(fun)

The available path generation functions are pathsDefault, pathsGLPG, and pathsPMS.

Further, the path generator can be set in the .Renviron file, the available functions being 'pathsDefault', 'pathsGLPG', and 'pathsPMS':

D4TAlink_pathgen="pathsDefault"