This vignette describes and explains logic behind common ways of creating rule packs.
Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.
Rule pack is a function which combines several rules for common data unit into one functional block. The recommended way of creating rules is by creating packs right away with the use of dplyr
and magrittr's pipe operator.
Some of ruler
's functionality is powered by the keyholder package. It is highly recommended to use its supported functions during rule pack construction. All one- and two-table dplyr
verbs applied to local data frames are supported and considered the most appropriate way to create rule packs.
As described in vignette about design process it is necessary for rule pack to have type because outputs for different data units have different structure. For this reason ruler
has family of *_packs()
constructors (where *
stands for the name of data unit):
To check whether dimensions of mtcars
obey some rules one can write the next
dplyr pipeline:
mtcars %>% summarise(
nrow_low = nrow(.) > 10,
nrow_high = nrow(.) < 30,
ncol = ncol(.) == 12
)
#> nrow_low nrow_high ncol
#> 1 TRUE FALSE FALSE
The output has the following structure:
There is an easy way to transform this pipeline into a function to be used for any data: mtcars
should be replaced with .
character. To indicate that this function is a rule pack for data unit 'data' it should be wrapped with data_packs()
.
The next code creates a list my_data_packs
with one data rule pack named my_data_pack_1
. That rule pack defines rules with names nrow_low
, nrow_high
, ncol
.
my_data_packs <- data_packs(
my_data_pack_1 = . %>% summarise(
nrow_low = nrow(.) > 10,
nrow_high = nrow(.) < 30,
ncol = ncol(.) == 12
)
)
To check whether certain groups of rows of mtcars
obey some rules one can write the next dplyr pipeline:
mtcars %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6))
#> `summarise()` regrouping output by 'vs' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups: vs [2]
#> vs am any_cyl_6
#> <dbl> <dbl> <lgl>
#> 1 0 0 FALSE
#> 2 0 1 TRUE
#> 3 1 0 TRUE
#> 4 1 1 FALSE
The output has the following structure:
vs
and am
in this case).The next code creates a list with one nameless group rule pack (the name will be
imputed during exposure). This pack contains one rule any_cyl_6
which checks every group defined by vs
and am
columns.
my_group_packs <- group_packs(
. %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6)),
.group_vars = c("vs", "am")
)
Notes:
ungroup
ed..group_vars
argument to distinguish them from non-grouping ones.var
column in validation report is created by uniting them with the default separator .
. In this case values will be 0.0
, 0.1
, 1.0
, 1.1
. To change separator supply it with .group_sep
argument.To check whether certain columns of mtcars
obey some rules one can write the next dplyr pipeline:
is_integerish <- function(x) {
all(x == as.integer(x))
}
mtcars %>%
summarise_if(is_integerish, funs(mean_low = mean(.) > 0.5))
#> Warning: `funs()` is deprecated as of dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> cyl_mean_low hp_mean_low vs_mean_low am_mean_low gear_mean_low carb_mean_low
#> 1 TRUE TRUE FALSE FALSE TRUE TRUE
The output has the following structure:
In general it is hard to automatically separate output's column names into 'validated column name' and 'rule name' because default separator _
is a commonly used one. For this reason ruler
has function rules()
which wraps funs()
with the following functionality:
rules()
's arguments.._.
(Morse code for 'R') to rule names. Note that one can change this prefix with .prefix
argument.The next code creates a list with two elements:
my_col_pack_1
which checks obedience of 'integerish' columns to rule mean_low
.vs
to some (will be imputed as rule__1
) rule. Note the use of named argument in vars(vs = "vs")
. This is the current way in dplyr
's scoped variants of summarise
and mutate
to force using both column and function names in output's column name.my_col_packs <- col_packs(
my_col_pack_1 = . %>% summarise_if(
is_integerish,
rules(mean_low = mean(.) > 0.5)
),
. %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300))
)
To check whether certain rows of mtcars
are not outliers one can write the next dplyr pipeline:
z_score <- function(x) {
(x - mean(x)) / sd(x)
}
mtcars %>%
mutate(rowMean = rowMeans(.)) %>%
transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
slice(10:15)
#> is_common_row_mean
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 TRUE
#> 5 TRUE
#> 6 FALSE
The output has the following structure:
Pipeline like the one above is quite common: for every row compute some value based on all rows and then validate only some of them. However in the validation report column id
should represent the row index in the original data frame and this information is missing after applying slice()
.
This problem is solved by using keyholder package. Its main purpose is to track information about rows while modifying data frame. During exposure pack is applied to the keyed version of input data with key equals to row index. Note that to use this feature one should create rule packs using composition of functions supported by keyholder
.
The next code creates a list with one row pack my_row_pack_1
. It contains one rule is_common_row_mean
that checks 6 rows (from 10 to 15) for not being an outlier (based on information from all rows) in terms of row means.
my_row_packs <- row_packs(
my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>%
transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
slice(10:15)
)
To check whether certain cells of mtcars
are not outliers one can write the next dplyr pipeline:
mtcars %>% transmute_if(
is_integerish,
funs(is_common = abs(z_score(.)) < 1)
) %>%
slice(20:24)
#> cyl_is_common hp_is_common vs_is_common am_is_common gear_is_common
#> 1 FALSE FALSE FALSE FALSE TRUE
#> 2 FALSE TRUE FALSE TRUE TRUE
#> 3 FALSE TRUE TRUE TRUE TRUE
#> 4 FALSE TRUE TRUE TRUE TRUE
#> 5 FALSE FALSE TRUE TRUE TRUE
#> carb_is_common
#> 1 FALSE
#> 2 FALSE
#> 3 TRUE
#> 4 TRUE
#> 5 TRUE
The output has the following structure:
Basically cell rule pack is a combination of column and row rule packs. It means:
rules()
instead of funs()
in scoped variants of transmute()
.keyholder
.The next code creates a list with one cell pack my_cell_pack_1
. It checks cells of every integer-like column in rows 20-24 for not being an outlier within column.
my_cell_packs <- cell_packs(
my_cell_pack_1 = . %>% transmute_if(
is_integerish,
rules(is_common = abs(z_score(.)) < 1)
) %>%
slice(20:24)
)