The intention of this vignette is to provide a tutorial, how to
prepare the metadata information of a dataset in order to use the
DQAstats
R package for data quality assessment.
The R package dataquieR
provides both a dataset and
corresponding metadata information. For demonstration purposes, the
SHIP-dataset from dataquieR
will be analyzed with
DQAstats
in the following. The necessary steps to convert
dataquieR
’s metadata representation into the format
understood by DQAstats
are provided along with some
background information.
dataquieR
R packageFirst of all, load the data files from the dataquieR
R
package. These files include the actual dataset to be analyzed
ship.RDS
and the corresponding metadata representation
ship_meta.RDS
. For further information on
dataquieR
, please refer to https://dataquality.ship-med.uni-greifswald.de/.
# the next lines are commentend, since we avoided to have 'dataquieR' as yet
# another dependency to our package; however, if you have installed it, you
# can uncomment these lines to load the required files from the dataquieR-package
#list.files(system.file("extdata", package = "dataquieR"))
# load the ship dataset
# ship_data <- readRDS(
# system.file("extdata/ship.RDS", package = "dataquieR")
# ) %>%
# data.table::data.table()
<- readRDS(
ship_data url(paste0(
"https://raw.githubusercontent.com/cran/",
"dataquieR/master/inst/extdata/ship.RDS"
))%>%
) ::data.table()
data.table
# export the dataset to a CSV-file (one of the formats supported by DQAstats)
<- "ship_data.csv"
ship_data_export_fn ::fwrite(
data.tablex = ship_data,
file = file.path(tempdir(), ship_data_export_fn)
)
# load the ship metadata
# ship_meta <- readRDS(
# system.file("extdata/ship_meta.RDS", package = "dataquieR")
# ) %>%
# data.table::data.table()
<- readRDS(
ship_meta url(paste0(
"https://raw.githubusercontent.com/cran/",
"dataquieR/master/inst/extdata/ship_meta.RDS"
))%>%
) ::data.table() data.table
Have a look at dataquieR
’s metadata representation:
%>%
ship_meta ::datatable(options = list(
DTscrollX = TRUE,
pageLength = 4
))
DQAstats
MDRThe most important part for using DQAstats
for data
quality assessment is to represent the metadata of the dataset(s) to be
analyzed in the correct format. The MDR format expected by
DQAstats
is described in the DQAstats-wiki.
An example MDR is also provided with the DQAstats
R
package here.
Furthermore, DQAstats
comes with an exemplary toy dataset,
which can be analyzed by executing the commands from the README.md.
DQAstats
MDR as templateIn order to get an empty template of the DQAstats
’ MDR
to be filled with the metadata information from dataquieR
,
we will just read the columns names from the example MDR:
<- data.table::fread(
mdr file = system.file(
"demo_data/utilities/MDR/mdr_example_data.csv",
package = "DQAstats"
),header = TRUE,
nrows = 0,
colClasses = "character"
)dim(mdr)
#> [1] 0 15
colnames(mdr)
#> [1] "designation" "source_variable_name" "filter"
#> [4] "source_table_name" "source_system_name" "source_system_type"
#> [7] "key" "variable_name" "variable_type"
#> [10] "restricting_date_var" "constraints" "dqa_assessment"
#> [13] "definition" "data_map" "plausibility_relation"
For a detailed description of the MDR fields, please refer to the DQAstats-wiki.
DQAstats
representationNext, we need to fill the rows of the DQAstats
MDR.
Since we are only analyzing one dataset (opposed to comparing datasets
from two different databases, whereas the dataset was transferred from
one of the databases to the other, e.g., via an extract-transfrom-load
(ETL) process), we can set the MDR files “designation”, “variable_name”,
and “key” to the same values.
# get names of dataelements from ship dataset
<- colnames(ship_data)
ship_var_names
# loop over dataelements
for (var in ship_var_names) {
# get variable type from ship metadata
<- ship_meta[get("VAR_NAMES") == var, get("DATA_TYPE")]
data_type # get variable description from ship metadata
<- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")]
labels <- strsplit(
split_list x = labels,
split = " | ",
fixed = TRUE
)# convert variable description into human readable format
if (length(split_list[[1]]) == 1 && is.na(split_list[[1]])) {
<- paste0("Description for dataelement '", var, "'")
descr else {
} <- paste0(
descr "Description for dataelement '", var,
"': \n\n", paste0(unlist(split_list), collapse = "; ")
)
}
# for this dataelement, fill row of DQAstats-MDR with extracted information
<- data.table::rbindlist(
mdr l = list(
mdr,::as.data.table(
data.tablex = cbind(
"designation" = var,
"definition" = descr,
"source_variable_name" = var,
"variable_name" = var,
"key" = var,
"source_table_name" = ship_data_export_fn,
"source_system_type" = "csv", # indicates that datasets format is CSV
"source_system_name" = "ship", # arbitrary name for this dataset
"dqa_assessment" = 1, # default value for dataelements to be analyzed
"variable_type" = data_type
)
)
),fill = TRUE
) }
In dataquieR
, categorical variables are defined as
“DATA_TYPE=integer” in the MDR. To work well with DQAstats
,
this needs to be changed to “enumerated” (see also the DQAstats-wiki
for further details).
# save all categorical variablees in a vector
<- c(
cat_vars "id", "sex", "obs_bp", "dev_bp", "obs_soma", "dev_length", "dev_weight",
"obs_int", "school", "family", "smoking", "stroke", "myocard",
"diab_known", "contraception", "income"
)
# change variable type from integer to "enumerated" (to get meaningful
# results from DQAstats)
get("designation") %in% cat_vars, ("variable_type") := "enumerated"] mdr[
Next, the formatting of the constraints defined in
dataquieR
’s MDR needs to be transformed to the formatting,
required by DQAstats
. Please find further details on the
definition of constraints in the DQAstats-wiki.
# loop over categorical variables
for (var in cat_vars) {
# get definitions of category-levels
<- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")]
labels <- strsplit(
split_list x = labels,
split = " | ",
fixed = TRUE
)
# extract the allowed values for each categorical dataelement
<- strsplit(
allowed_values x = split_list[[1]],
split = " =",
fixed = TRUE
)
# transform allowed values into a "value_set" array, which is stored
# as a JSON object (required by DQAstats)
<- jsonlite::toJSON(
c list(
"value_set" = sapply(
X = allowed_values,
FUN = function(x) {
return(x[[1]])
})
),pretty = TRUE
)# add constraints for this dataelement to the DQAstats MDR
get("source_variable_name") == var, ("constraints") := c]
mdr[ }
# for comparison, show metadata definition from dataquieR ...
# dataquieR: string, split by pipes '|'
get("VAR_NAMES") == "smoking", get("VALUE_LABELS")]
ship_meta[#> [1] "0 = nonsmoker | 1 = former smoker | 2 = smoker"
# ... and DQAstats: JSON
get("source_variable_name") == "smoking", get("constraints")]
mdr[#> [1] "{\n \"value_set\": [\"0\", \"1\", \"2\"]\n}"
get("source_variable_name") == "smoking", get("constraints")] %>%
mdr[::fromJSON()
jsonlite#> $value_set
#> [1] "0" "1" "2"
We can see here that the mapping from the data codes to the
respective values gets lost during the transformation of constraints
from dataquieR
to DQAstats
. This is because
DQAstats
is meant to be applied directly to the raw dataset
and checks the conforming of the observed data values to the respective
constraints for each variable.
Furthermore, it is to be noted that DQAstats
does not require constraints to be defined for each
data element in order to be applied to a dataset. Thus, data quality
assessment with DQAstats
can also be performed without the
definition of constraints.
# get names of continuous variables (float/integer)
<- mdr[
cont_vars get("variable_type") %in% c("integer", "float"),
unique(get("source_variable_name"))
]
# loop over continuous dataelements
for (var in cont_vars) {
# extract constraints of continuous dataelement
<- ship_meta[get("VAR_NAMES") == var, get("HARD_LIMITS")]
labels <- strsplit(
split_list x = labels,
split = ";",
fixed = TRUE
)
# transform these constraints to a list, with entries "min" and "max"
<- lapply(
constraints X = split_list,
FUN = function(x) {
if (!is.na(x[[1]])) {
return(
list(
"min" = as.numeric(gsub("[[:punct:]]", "", x[[1]])),
"max" = as.numeric(gsub("[[:punct:]]", "", x[[2]]))
)
)else {
} return("error")
}
}
)
# save list as nested JSON within the key "range" (required by DQAstats)
if (length(constraints[[1]]) > 1) {
<- jsonlite::toJSON(
c list("range" = constraints[[1]]),
pretty = TRUE,
auto_unbox = TRUE
)# write constraint for this dataelement to DQAstats MDR
get("source_variable_name") == var, ("constraints") := c]
mdr[
} }
When setting “id” as variable_type="string"
, we are able
to test for constraints using regular expressions. This is especially
meaningful, when investigating a categorical variable with a lot of
categories. Here, a simple regular expression is used, however, they can
get arbitrarily complex.
# The `id` variable should actually not be treated as a categorical variable,
# since the number of potential categories is (theoretically) unlimited and
# an analysis of the categories in this case is seldom meaningful.
# Nevertheless, we could perform quality checks on the `id` variable, e.g. to
# check for a required formatting of the values.
# This can be achieved with DQAstats by changing the variable type to "string"
# and use a regular expression that checks for the expected format.
# change variable type to "string"
get("source_variable_name") == "id", ("variable_type") := "string"]
mdr[
# add regular expression for the dataelement "id"
get("source_variable_name") == "id", ("constraints") := jsonlite::toJSON(
mdr[list("regex" = "^[[:digit:]]{1,5}$"),
pretty = TRUE,
auto_unbox = TRUE
)]
Currently, there is no mechanism to automatically transform the
plausibility checks (or “contradictions” as named by
dataquieR
) from the dataquieR
format to
DQAstats
. The following checks were manually taken from the
file ship_contradiction_checks.csv
and transformed into the representation required by
DQAstats
.
Please find further details on the definition of plausibility checks in the DQAstats-wiki.
# add plausibilities
# sex and contraception
<- jsonlite::toJSON(list(
p "atemporal" = list(
"contraception" = list(
"name" = "A_present_and_B_vv",
"description" = "Contracept in males.",
"join_crit" = "id",
"filter" = list(
"ship" = "^1$"
),"constraints" = list(
"value_set" = list(
"ship" = "1"
)
)
)
)
),pretty = TRUE,
auto_unbox = TRUE
)
# write plausibility to mdr
mdr[== "sex" &
source_variable_name == 1,
dqa_assessment := p
plausibility_relation ]
<- jsonlite::toJSON(list(
p "atemporal" = list(
"diab_age" = list(
"name" = "A_present_and_B_levels_vl",
"description" = "Diab age but no diab.",
"join_crit" = "id",
"filter" = list(
"ship" = "^(?!.*(99900|99901|99801)).*$"
),"constraints" = list(
"value_set" = list(
"ship" = "1"
)
)
)
)),pretty = TRUE,
auto_unbox = TRUE
)
mdr[== "diab_known" &
source_variable_name == 1,
dqa_assessment := p
plausibility_relation ]
<- jsonlite::toJSON(list(
p "uniqueness" = list(
"sex" = list(
"name" = "Every ID is associated with one Sex",
"description" = paste0("With each distinct value of 'id', ",
"only one value of 'sex' may be associated."
)
))),pretty = TRUE,
auto_unbox = TRUE
)
mdr[== "id" &
source_variable_name == 1,
dqa_assessment := p
plausibility_relation ]
Now, have a look at the MDR prepared for DQAstats
:
%>%
mdr ::datatable(options = list(
DTscrollX = TRUE,
pageLength = 4
))
utils
-folderDQAstats
requires a so-called
utilities
-folder that contains at least two
sub-directories, named MDR
and RMD
. Please
find further details on this folder in the DQAstats-wiki.
When having created this folder structure, the MDR prepared for
DQAstats
can be saved as ‘CSV’-file to the
MDR
-sub-directory. This file is then used as input for the
function DQAstats::dqa()
.
<- file.path(tempdir(), "utilities")
utils_path dir.create(file.path(utils_path, "MDR"), recursive = TRUE)
::fwrite(
data.tablex = mdr,
file = file.path(utils_path, "MDR/mdr.csv")
)
The RMD
sub-directory of the
utilities
-folder contains the template files for the
Rmarkdown report. For demonstration purposes, the templates provided
with DQAstats
can be used directly.
file.copy(
from = system.file("demo_data/utilities/RMD", package = "DQAstats"),
to = utils_path,
overwrite = TRUE,
recursive = TRUE
)
DQAstats::dqa()
Since we are here not aiming to compare two datasets, we are going to
set the target_system_name
to
source_system_name
. Thus we will get the analysis of the
database compared with itself.
Furthermore, the path to the utilities
-folder, the
logging directory, the MDR-filename, as well as the directory to store
the resulting PDF-report (output_dir
) need to be defined
and provided to the function DQAstats::dqa()
.
Additionally, the directory that contains the dataset(s) to be
analyzed must be specified using an environment variable. This variable
is composed of the dataset-name (as specified in the
mdr.csv
field source_table_name
) and the
suffix _PATH
, here SHIP_PATH
.
# ship data set
<- "ship"
source_system_name <- source_system_name
target_system_name
<- "mdr.csv"
mdr_filename <- file.path(tempdir(), "output")
output_dir = tempdir()
logfile_dir
# does only work, if "ship_data.csv" is lying next to this RMD-file
Sys.setenv("SHIP_PATH" = tempdir())
# provide all arguments to main function
<- DQAstats::dqa(
all_results source_system_name = source_system_name,
target_system_name = target_system_name,
utils_path = utils_path,
mdr_filename = mdr_filename,
output_dir = output_dir,
logfile_dir = logfile_dir,
parallel = FALSE
)
DQAgui
as GUI-frontend to
DQAstats
When executing the following command, the GUI will be launched at
port 3838
by default and can be accessed from any
web-browser via http://localhost:3838
.
A tutorial for DQAgui
can be found here.
library(DQAgui)
# set basepath for file-browser in GUI-config
Sys.setenv("CSV_SOURCE_BASEPATH" = tempdir())
## launch GUI
::launch_app(
DQAguiutils_path = utils_path,
mdr_filename = "mdr.csv",
parallel = FALSE
)# nolint end