Data validation rules control what constants can be entered into a cell, e.g. any whole number between 0 and 9, or one of several values from another part of the spreadsheet.
‘xlsx_validation()’ returns each of the data validation rules in an xlsx file, and the ranges of cells to which each rule applies.
Here is a rule that restricts input to integers between 0 and 9 inclusive, or no value (blank). If any other value is attempted, then an error message is displayed with the imaginative title “message title”, the informative body text “message body”, and a “stop” symbol.
library(tidyxl)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
<- system.file("extdata/examples.xlsx", package = "tidyxl")
examples glimpse(xlsx_validation(examples)[1, ])
#> Rows: 1
#> Columns: 14
#> $ sheet <chr> "Sheet1"
#> $ ref <chr> "A106"
#> $ type <chr> "whole"
#> $ operator <chr> "between"
#> $ formula1 <chr> "0"
#> $ formula2 <chr> "9"
#> $ allow_blank <lgl> TRUE
#> $ show_input_message <lgl> TRUE
#> $ prompt_title <chr> "message title"
#> $ prompt_body <chr> "message body"
#> $ show_error_message <lgl> TRUE
#> $ error_title <chr> "error title"
#> $ error_body <chr> "error body"
#> $ error_symbol <chr> "stop"
The gamut of possible rules is given in the examples for
xlsx_validation()
.
as.data.frame(xlsx_validation(examples))
#> sheet ref type operator formula1
#> 1 Sheet1 A106 whole between 0
#> 2 Sheet1 A108 list <NA> $B$108
#> 3 Sheet1 A110 date between 2017-01-01 00:00:00
#> 4 Sheet1 A111 time between 00:00:00
#> 5 Sheet1 A112 textLength between 0
#> 6 Sheet1 A114 whole notBetween 0
#> 7 Sheet1 A115,A121:A122 whole equal 0
#> 8 Sheet1 A116 whole notEqual 0
#> 9 Sheet1 A117 whole greaterThan 0
#> 10 Sheet1 A119 whole greaterThanOrEqual 0
#> 11 Sheet1 A120 whole lessThanOrEqual 0
#> 12 Sheet1 A118 whole lessThan 0
#> 13 Sheet1 A107 decimal notBetween 0
#> 14 Sheet1 A113 custom <NA> A113<=LEN(B113)
#> 15 Sheet1 A109 list <NA> $B$108
#> formula2 allow_blank show_input_message prompt_title
#> 1 9 TRUE TRUE message title
#> 2 <NA> TRUE TRUE <NA>
#> 3 2017-01-09 09:00:00 TRUE TRUE <NA>
#> 4 09:00:00 TRUE TRUE <NA>
#> 5 9 TRUE TRUE <NA>
#> 6 9 TRUE TRUE <NA>
#> 7 <NA> TRUE TRUE <NA>
#> 8 <NA> TRUE TRUE <NA>
#> 9 <NA> TRUE TRUE <NA>
#> 10 <NA> TRUE TRUE <NA>
#> 11 <NA> TRUE TRUE <NA>
#> 12 <NA> TRUE TRUE <NA>
#> 13 9 FALSE FALSE <NA>
#> 14 <NA> TRUE TRUE <NA>
#> 15 <NA> TRUE TRUE <NA>
#> prompt_body show_error_message error_title error_body error_symbol
#> 1 message body TRUE error title error body stop
#> 2 <NA> TRUE <NA> <NA> warning
#> 3 <NA> TRUE <NA> <NA> stop
#> 4 <NA> TRUE <NA> <NA> stop
#> 5 <NA> TRUE <NA> <NA> stop
#> 6 <NA> TRUE <NA> <NA> stop
#> 7 <NA> TRUE <NA> <NA> stop
#> 8 <NA> TRUE <NA> <NA> stop
#> 9 <NA> TRUE <NA> <NA> stop
#> 10 <NA> TRUE <NA> <NA> stop
#> 11 <NA> TRUE <NA> <NA> stop
#> 12 <NA> TRUE <NA> <NA> stop
#> 13 <NA> FALSE <NA> <NA> stop
#> 14 <NA> TRUE <NA> <NA> stop
#> 15 <NA> TRUE <NA> <NA> information
There are no built-in functions for joining ranges like
A1:D5,G8
to single cells likeB3
. For now, use the snippets in this section. In future I might develop a dplyr-like join function (this is hard currently because dplyr doesn’t yet join on arbitrary functions, or even the standard inequalities like>=
). Help and advice would be gratefully accepted!
To join rules to cells, a naive method is to use the
sheet
and ref
columns to match the
sheet
and address
columns to the output of
xlsx_cells()
.
<- xlsx_validation(examples)
rules <- filter(xlsx_cells(examples), row >= 106, col == 1)
cells
rules#> # A tibble: 15 × 14
#> sheet ref type opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
#> 1 Sheet1 A106 whole between 0 9 TRUE TRUE messag… messag…
#> 2 Sheet1 A108 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA>
#> 3 Sheet1 A110 date between 2017-0… 2017-0… TRUE TRUE <NA> <NA>
#> 4 Sheet1 A111 time between 00:00:… 09:00:… TRUE TRUE <NA> <NA>
#> 5 Sheet1 A112 text… between 0 9 TRUE TRUE <NA> <NA>
#> 6 Sheet1 A114 whole notBet… 0 9 TRUE TRUE <NA> <NA>
#> 7 Sheet1 A115,A1… whole equal 0 <NA> TRUE TRUE <NA> <NA>
#> 8 Sheet1 A116 whole notEqu… 0 <NA> TRUE TRUE <NA> <NA>
#> 9 Sheet1 A117 whole greate… 0 <NA> TRUE TRUE <NA> <NA>
#> 10 Sheet1 A119 whole greate… 0 <NA> TRUE TRUE <NA> <NA>
#> 11 Sheet1 A120 whole lessTh… 0 <NA> TRUE TRUE <NA> <NA>
#> 12 Sheet1 A118 whole lessTh… 0 <NA> TRUE TRUE <NA> <NA>
#> 13 Sheet1 A107 deci… notBet… 0 9 FALSE FALSE <NA> <NA>
#> 14 Sheet1 A113 cust… <NA> A113<=… <NA> TRUE TRUE <NA> <NA>
#> 15 Sheet1 A109 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA>
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> # error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> # ¹operator, ²formula1, ³formula2, ⁴allow_blank, ⁵show_input_message,
#> # ⁶prompt_title, ⁷prompt_body
cells#> # A tibble: 93 × 24
#> sheet address row col is_blank content data_…¹ error logical numeric
#> <chr> <chr> <int> <int> <lgl> <chr> <chr> <chr> <lgl> <dbl>
#> 1 Sheet1 A106 106 1 FALSE 0 numeric <NA> NA 0
#> 2 Sheet1 A107 107 1 FALSE 0.1 numeric <NA> NA 0.1
#> 3 Sheet1 A108 108 1 FALSE 137 charac… <NA> NA NA
#> 4 Sheet1 A109 109 1 FALSE 137 charac… <NA> NA NA
#> 5 Sheet1 A110 110 1 FALSE 42736 date <NA> NA NA
#> 6 Sheet1 A111 111 1 FALSE 0.35416666… date <NA> NA NA
#> 7 Sheet1 A112 112 1 FALSE 149 charac… <NA> NA NA
#> 8 Sheet1 A113 113 1 FALSE 10 numeric <NA> NA 10
#> 9 Sheet1 A114 114 1 FALSE -1 numeric <NA> NA -1
#> 10 Sheet1 A115 115 1 FALSE 0 numeric <NA> NA 0
#> # … with 83 more rows, 14 more variables: date <dttm>, character <chr>,
#> # character_formatted <list>, formula <chr>, is_array <lgl>,
#> # formula_ref <chr>, formula_group <int>, comment <chr>, height <dbl>,
#> # width <dbl>, row_outline_level <dbl>, col_outline_level <dbl>,
#> # style_format <chr>, local_format_id <int>, and abbreviated variable name
#> # ¹data_type
inner_join(rules, cells, by = c("sheet" = "sheet", "ref" = "address"))
#> # A tibble: 9 × 36
#> sheet ref type opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
#> 1 Sheet1 A106 whole between 0 9 TRUE TRUE messag… messag…
#> 2 Sheet1 A108 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA>
#> 3 Sheet1 A110 date between 2017-0… 2017-0… TRUE TRUE <NA> <NA>
#> 4 Sheet1 A111 time between 00:00:… 09:00:… TRUE TRUE <NA> <NA>
#> 5 Sheet1 A112 textLeng… between 0 9 TRUE TRUE <NA> <NA>
#> 6 Sheet1 A114 whole notBet… 0 9 TRUE TRUE <NA> <NA>
#> 7 Sheet1 A107 decimal notBet… 0 9 FALSE FALSE <NA> <NA>
#> 8 Sheet1 A113 custom <NA> A113<=… <NA> TRUE TRUE <NA> <NA>
#> 9 Sheet1 A109 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA>
#> # … with 26 more variables: show_error_message <lgl>, error_title <chr>,
#> # error_body <chr>, error_symbol <chr>, row <int>, col <int>, is_blank <lgl>,
#> # content <chr>, data_type <chr>, error <chr>, logical <lgl>, numeric <dbl>,
#> # date <dttm>, character <chr>, character_formatted <list>, formula <chr>,
#> # is_array <lgl>, formula_ref <chr>, formula_group <int>, comment <chr>,
#> # height <dbl>, width <dbl>, row_outline_level <dbl>,
#> # col_outline_level <dbl>, style_format <chr>, local_format_id <int>, and …
Notice that only 9 cells were joined, even though 15 rules were
defined. Surely at least 15 cells ought to be joined? The reason why
they are not is that the cells for the other 6 rules don’t exist – rules
can be defined for cells that have no value, and cells with no value are
not returned by xlsx_cells()
, otherwise all 17179869184
cells in a worksheet must be returned.
A more subtle reason for certain cells not to have joined
successfully is that the ref
column of the rules sometimes
refers to more than one cell, and can even refer to several,
non-contiguous ranges of cells. Specifically, the seventh rule’s
ref
column has A115,A121:A122
.
Special treatment is needed here. Ideally, some kind of join function
would be defined that can compare indidual cells with ranges. But I
haven’t written one, so what follows is a workaround. First, the two
ranges of cells must be unnested
into A115
and
A121:122
. Then the range A121:122
must be
‘unranged’ into A121
and A122
.
<- function(x) {
unrange <- cellranger::as.cell_limits(x)
limits <- seq(limits$ul[1], limits$lr[1])
rows <- seq(limits$ul[2], limits$lr[2])
cols <- expand.grid(rows, cols)
rowcol <- cellranger::cell_addr(rowcol[[1]], rowcol[[2]])
cell_addrs ::to_string(cell_addrs, fo = "A1", strict = FALSE)
cellranger
}
<- function(x, ref) {
unnest_ref UseMethod("unnest_ref")
}
<- function(x, ref_col = ref) {
unnest_ref.default stopifnot(is.character(x), length(x) == 1L)
<- unlist(strsplit(x, ",", fixed = TRUE))
refs unlist(lapply(refs, unrange))
}
unrange("A121:A122")
#> [1] "A121" "A122"
unnest_ref("A115,A121:A122")
#> [1] "A115" "A121" "A122"
The unnest_ref()
function can also be defined for whole
data frames, unnesting them by a column of references.
<- function(x, ref_col) {
unnest_ref.data.frame <- rlang::enquo(ref_col)
ref ::quo_name(ref)]] <- lapply(x[[rlang::quo_name(ref)]], unnest_ref)
x[[rlang::unnest(x, rlang::UQ(ref))
tidyr
}
<- slice(rules, 7))
(nested_rule #> # A tibble: 1 × 14
#> sheet ref type opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
#> 1 Sheet1 A115,A12… whole equal 0 <NA> TRUE TRUE <NA> <NA>
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> # error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> # ¹operator, ²formula1, ³formula2, ⁴allow_blank, ⁵show_input_message,
#> # ⁶prompt_title, ⁷prompt_body
unnest_ref(nested_rule, ref)
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#>
#> # Bad:
#> rlang::expr(mean(rlang::UQ(var) * 100))
#>
#> # Ok:
#> rlang::expr(mean(UQ(var) * 100))
#>
#> # Good:
#> rlang::expr(mean(!!var * 100))
#>
#> This warning is displayed once per session.
#> # A tibble: 3 × 14
#> sheet ref type operator formula1 formula2 allow_…¹ show_…² promp…³ promp…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
#> 1 Sheet1 A115 whole equal 0 <NA> TRUE TRUE <NA> <NA>
#> 2 Sheet1 A121 whole equal 0 <NA> TRUE TRUE <NA> <NA>
#> 3 Sheet1 A122 whole equal 0 <NA> TRUE TRUE <NA> <NA>
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> # error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> # ¹allow_blank, ²show_input_message, ³prompt_title, ⁴prompt_body
Finally the new data frame of rules can be joined to a data frame of
cells in any of the usual ways, via the sheet
and
ref
columns.
Problems with this approach occur with rules that are defined over large ranges of cells: the ‘unnesting’ of those ranges results in very long vectors of individual cell addresses, or (worse) huge data frames of rules. Such cases are commonplace, because rules are often defined for entire columns of a spreadsheet, and a column has 1048576 rows.