Programming with tidyr

Introduction

Most tidyr verbs use tidy evaluation to make interactive data exploration fast and fluid. Tidy evaluation is a special type of non-standard evaluation used throughout the tidyverse. Here’s some typical tidyr code:

library(tidyr)

iris %>%
  nest(data = !Species)
#> # A tibble: 3 × 2
#>   Species    data             
#>   <fct>      <list>           
#> 1 setosa     <tibble [50 × 4]>
#> 2 versicolor <tibble [50 × 4]>
#> 3 virginica  <tibble [50 × 4]>

Tidy evaluation is why we can use !Species to say “all the columns except Species”, without having to quote the column name ("Species") or refer to the enclosing data frame (iris$Species).

Two basic forms of tidy evaluation are used in tidyr:

We focus on tidy selection here, since it’s the most common. You can learn more about data masking in the equivalent vignette in dplyr: https://dplyr.tidyverse.org/dev/articles/programming.html. For other considerations when writing tidyr code in packages, please see vignette("in-packages").

We’ve pointed out that tidyr’s tidy evaluation interface is optimized for interactive exploration. The flip side is that this adds some challenges to indirect use, i.e. when you’re working inside a for loop or a function. This vignette shows you how to overcome those challenges. We’ll first go over the basics of tidy selection and data masking, talk about how to use them indirectly, and then show you a number of recipes to solve common problems.

Before we go on, we reveal the version of tidyr we’re using and make a small dataset to use in examples.

packageVersion("tidyr")
#> [1] '1.2.1'

mini_iris <- as_tibble(iris)[c(1, 2, 51, 52, 101, 102), ]
mini_iris
#> # A tibble: 6 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#> 1          5.1         3.5          1.4         0.2 setosa    
#> 2          4.9         3            1.4         0.2 setosa    
#> 3          7           3.2          4.7         1.4 versicolor
#> 4          6.4         3.2          4.5         1.5 versicolor
#> 5          6.3         3.3          6           2.5 virginica 
#> 6          5.8         2.7          5.1         1.9 virginica

Tidy selection

Underneath all functions that use tidy selection is the tidyselect package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. For example:

You can see more details in ?tidyr_tidy_select.

Indirection

Tidy selection makes a common task easier at the cost of making a less common task harder. When you want to use tidy select indirectly with the column specification stored in an intermediate variable, you’ll need to learn some new tools. There are three main cases where this comes up:

Note that many tidyr functions use ... so you can easily select many variables, e.g. fill(df, x, y, z). I now believe that the disadvantages of this approach outweigh the benefits, and that this interface would have been better as fill(df, c(x, y, z)). For new functions that select columns, please just use a single argument and not ....