There are several utilities in the R ecosystem for reproducible research. The package repana (for Reproducible Analysis) help in having a common directory structure where to save the files that should be consider part of the main stream of production, and files that are products of the main stream as well as modified files no longer part of the main stream but need to be kept such as re-formatted reports or presentations.
The aspiration of this package is that you can set up an analysis with the make_structure()
function, have access to the database using get_con()
function, have your tables in the database documented with the update_table()
function, and reproduce a complete analysis by running the master()
function.
The function make_structure()
reads the config.yml
files and ensure all entries on the dirs
section exists. The config.yml
created by default will produce the following directories for the data, functions, database, logs, reports and handmade entries of the config.yml
:
_data to keep all data sources need to reproduce the analysis.
_functions to keep all functions programmed for the analysis
database to keep all secondary datasets and objects
logs to keep the log of the scripts
reports to keep all secondary analysis reports and sheets
handmade to keep all modified files and reports that should be kept
The directories can be used in your programs by the config::get("dirs")
function.
The information in data
, functions
, handmade
as well as the scripts in the root directory should be preserved as they are the core of your analysis. The files in logs
, reports
and database
are created and recreated as result of your analysis’ scripts. Those are the results of your analysis.
The function clean_structure()
clean those directories included in the list clean_before_new_analysis
so a new analysis could be re-run without worries that new and old results are mixed. If you use git
, those directories are candidates to be excluded from the control version by having them in the .gitignore
file. (make_structure()
take care of create a .gitignore
if it does not exist or include those directories if they have not been yet included).
The function make_structure()
writes a config.yml
if it does not exists. This file is used by the config::get()
function. It contains a the following entries under the default: tree
dirs: to define the directories that make_structure will maintain. It have the entry values for data
, functions
, database
, reports
, logs
, handmade
directories. If you prefer other options than the defaults values you may change it. You may access those directories in your programs using config::get("dirs")
Note that the name for data
and function
does not have a underscore but by default the values are _data
and _function
respectively. make_structure
will create the paths with the value of the entries but you access them in your program with the name of the entry. This will provide the freedom to direct the real path of the directory to any place you need.
clean_before_new_analysis: to define the list of directories that should be cleaned every time you want to repeat the analysis from zero. This directories are included in the .gitignore
defaultdb: is written with the parameters for a RSQLite SQLite
driver
You may add other entries that your analysis may requires. The config.yml
itself should also be included in your .gitignore
file as it is something that change from system to system (i.e. driver parameters) so you should include in the documentation of your analysis what entries should be defined so you can reproduce the analysis in other machine.
DBI and Pool connections are used as a way to keep data as well as results in a database system. You must provide, at least, values for the package
and dbconnection
entries corresponding to the package that host the dbConnection and the name of the dbConnection function. Notice entry names are lowercase. The rest of the entries must correspond for parameters for your driver connection or pool connection.
Example to use RSQLite
with a results.db
file in the database directory
defaultdb:
package: RSQLite
dbconnect: SQLite
dbname: database/results.db
Example to use RPostgres
defaultdb:
package: RPostgres
dbconnection: Postgres
dbname: testdb
host: localhost
port: 5432
user: username
password: password
You can define several configuration to use different databases in the same analysis, but the defaultdb
will be used by default for the update_table()
function.
The function update_table
will save a data.frame into the database, and will keep a log in the log_table
table with the timestamps the file was updated in the database. The log_table
keep a record of when was the table included in the database and a comment that will help to trace the origin. You may include the date data was obtained or the source of the data.
update_table(p_con,"iris", "from system)
The master(pattern, start, stop, logdir, rscript_path)
function execute in a plain vanilla R process each one of the files identified by the pattern. By default use the pattern is "^[0-9][0-9].*\\.R$"
, which include all files like 01_read_data.R
, 02_process_data.R
, 03_analysis_data.R
, 04_make_report.R
but not report1.rmd
, exploratory.R
etc.. Files are run on the order starting from the first but if for any reason you need to omit the first files you may skip them with the start
parameter.
logdir
is the directory for the logs, by default config::get("dirs")$logs
rscript_path
is the full path to the Rscript
program which is at the end the one that process the R
file. The current default is for a OS system. But implementation for Linux and Windows will soon be implemented.
The master function use functions from the processx
package.
Here is an example of the config.yml
created by make_structure()
default:
dirs:
data: _data
functions: _functions
handmade: handmade
database: database
reports: reports
logs: logs
clean_before_new_analysis:
- database
- reports
- logs
defaultdb:
package: RSQLite
dbconnect: SQLite
dbname: ":memory:"
driver: /usr/local/lib/libsqlite3odbc.dylib
database: database/results.db
You may download the package from git-hub. Within R you may use devtools::install_github()
as:
devtools::install_github("johnaponte/repana", build_manual = T, build_vignettes = T)