assertr

assertr logo

Build Status CRAN RStudio mirror downloads

What is it?

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

This package does not need to be used with the magrittr/dplyr piping mechanism but the examples in this README use them for clarity.

Installation

You can install the latest version on CRAN like this

    install.packages("assertr")

or you can install the bleeding-edge development version like this:

    install.packages("devtools")
    devtools::install_github("ropensci/assertr")

What does it look like?

This package offers five assertion functions, assert, verify, insist, assert_rows, and insist_rows, that are designed to be used shortly after data-loading in an analysis pipeline…

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding. Pretend we wanted to find the average miles per gallon for each number of engine cylinders. We might want to first, confirm - that it has the columns “mpg”, “vs”, and “am” - that the dataset contains more than 10 observations - that the column for ‘miles per gallon’ (mpg) is a positive number - that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and - that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only - each row contains at most 2 NAs - each row is unique jointly between the “mpg”, “am”, and “wt” columns - each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

This could be written (in order) using assertr like this:

    library(dplyr)
    library(assertr)

    mtcars %>%
      verify(has_all_names("mpg", "vs", "am", "wt")) %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

If any of these assertions were violated, an error would have been raised and the pipeline would have been terminated early.

Let’s see what the error message look like when you chain a bunch of failing assertions together.

    > mtcars %>%
    +   chain_start %>%
    +   assert(in_set(1, 2, 3, 4), carb) %>%
    +   assert_rows(rowMeans, within_bounds(0,5), gear:carb) %>%
    +   verify(nrow(.)==10) %>%
    +   verify(mpg < 32) %>%
    +   chain_end
    There are 7 errors across 4 verbs:
    -
             verb redux_fn           predicate     column index value
    1      assert     <NA>  in_set(1, 2, 3, 4)       carb    30   6.0
    2      assert     <NA>  in_set(1, 2, 3, 4)       carb    31   8.0
    3 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    30   5.5
    4 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    31   6.5
    5      verify     <NA>       nrow(.) == 10       <NA>     1    NA
    6      verify     <NA>            mpg < 32       <NA>    18    NA
    7      verify     <NA>            mpg < 32       <NA>    20    NA

    Error: assertr stopped execution

What does assertr give me?

assertr also offers four (so far) predicate functions designed to be used with the assert and assert_rows functions:

and predicate generators designed to be used with the insist and insist_rows functions:

and the following row reduction functions designed to be used with assert_rows and insist_rows:

More info

For more info, check out the assertr vignette

    > vignette("assertr")

Or read it here

ropensci_footer