Use case for nplyr

The package README uses the gapminder dataset to demonstrate nplyr’s functionality. In this case (and other similar cases) the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.

library(nplyr)

gm_nest <- 
  gapminder::gapminder_unfiltered %>%
  tidyr::nest(country_data = -continent)

gm_nest
#> # A tibble: 6 × 2
#>   continent country_data        
#>   <fct>     <list>              
#> 1 Asia      <tibble [578 × 5]>  
#> 2 Europe    <tibble [1,302 × 5]>
#> 3 Africa    <tibble [637 × 5]>  
#> 4 Americas  <tibble [470 × 5]>  
#> 5 FSU       <tibble [139 × 5]>  
#> 6 Oceania   <tibble [187 × 5]>

# we can use nplyr to perform operations on the nested data
gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000) %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

# in this case, we could have obtained the same result with tidyr and dplyr
gm_nest %>%
  tidyr::unnest(country_data) %>%
  group_by(continent) %>%
  filter(year == max(year)) %>%
  mutate(pop_millions = pop/1000000) %>%
  ungroup() %>%
  filter(continent == "Asia")
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!

A motivating example

Consider a set of surveys that an organization might use to gather market data. It is common for organizations to have separate surveys for separate purposes but gather the same baseline set of data across all surveys (for example, a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:

Survey 1: Job
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your job? (multiple choice)
  5. What is your annual salary? (numeric entry)
Survey 2: Personal Life
  1. How old are you? (multiple choice)
  2. What city do you live in (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your personal life (multiple choice)
  5. Please provide additional detail (text entry)

In this scenario, both surveys are collecting demographic information — age, location, and industry — but differ in the questions. A convenient way to get the response files into the environment would be to use purrr::map() to read in each file to a nested data frame.

path <- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"

surveys <- 
  tibble::tibble(survey_file = c("job_survey", "personal_survey")) %>%
  mutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))

surveys
#> # A tibble: 2 × 2
#>   survey_file     survey_data            
#>   <chr>           <list>                 
#> 1 job_survey      <spec_tbl_df [500 × 6]>
#> 2 personal_survey <spec_tbl_df [750 × 6]>

tidyr::unnest() can usually handle idiosyncrasies in layout when unnesting but in this case unnesting throws an error!

surveys %>%
  tidyr::unnest(survey_data)
#> Error:
#> ! Can't combine `Q5` <double> and `Q5` <character>.

This is because the surveys share column names but not necessarily column types! In this case, both data frames contain a column named “Q5”, but in job_survey it’s a double and in personal_survey it’s a character.

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1          <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2          <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3          <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4          <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5          <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…

surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1          <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2          <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3          <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4          <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5          <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…

We could potentially get around this issue with unnesting by reading in all columns as characters via readr::read_csv(x, col_types = cols(.default = "c")), but this presents its own challenges. Q5 would still be better represented as a double in job_survey and from the survey question text Q4 has similar, but distinctly different, meanings across the survey files.

This is where nplyr comes into play! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.

surveys <- 
  surveys %>%
  nest_mutate(survey_data,
              age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
  nest_group_by(survey_data, Q3) %>%
  nest_add_count(survey_data, 
                 name = "n_respondents_in_industry") %>%
  nest_mutate(survey_data, 
              median_industry_age = median(Q1)) %>%
  nest_ungroup(survey_data)

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 500 × 10
#>    survey_file survey_name    Q1 Q2          Q3            Q4       Q5 age_group
#>    <chr>       <chr>       <dbl> <chr>       <chr>         <chr> <dbl> <chr>    
#>  1 job_survey  job           100 Austin      Consulting    Some…   163 Retireme…
#>  2 job_survey  job            81 San Antonio Consulting    Neit…    48 Retireme…
#>  3 job_survey  job            51 Austin      Consulting    Extr…   190 Adult    
#>  4 job_survey  job            81 Austin      Technology    Extr…    25 Retireme…
#>  5 job_survey  job            80 Dallas      Consulting    Extr…   143 Retireme…
#>  6 job_survey  job            32 Fort Worth  Energy        Some…   233 Adult    
#>  7 job_survey  job            65 Dallas      Consulting    Some…    43 Retireme…
#>  8 job_survey  job            57 Houston     Healthcare    Some…   243 Adult    
#>  9 job_survey  job            43 Dallas      Government S… Neit…   158 Adult    
#> 10 job_survey  job            94 Fort Worth  Healthcare    Extr…   235 Retireme…
#> # … with 490 more rows, and 2 more variables: n_respondents_in_industry <int>,
#> #   median_industry_age <dbl>

surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 750 × 10
#>    survey_file     survey_name    Q1 Q2          Q3        Q4    Q5    age_group
#>    <chr>           <chr>       <dbl> <chr>       <chr>     <chr> <chr> <chr>    
#>  1 personal_survey personal       91 Austin      Energy    Neit… Blan… Retireme…
#>  2 personal_survey personal       32 San Antonio Healthca… Extr… Elem… Adult    
#>  3 personal_survey personal       40 San Antonio Consulti… Some… Eget… Adult    
#>  4 personal_survey personal       23 Austin      Consulti… Extr… Scel… Adult    
#>  5 personal_survey personal       88 Dallas      Technolo… Neit… Aene… Retireme…
#>  6 personal_survey personal       69 Fort Worth  Technolo… Neit… Inte… Retireme…
#>  7 personal_survey personal       96 Houston     Healthca… Extr… Blan… Retireme…
#>  8 personal_survey personal       40 Houston     Consulti… Extr… Scel… Adult    
#>  9 personal_survey personal       57 Fort Worth  Energy    Extr… Pede… Adult    
#> 10 personal_survey personal       40 Fort Worth  Healthca… Extr… Phar… Adult    
#> # … with 740 more rows, and 2 more variables: n_respondents_in_industry <int>,
#> #   median_industry_age <dbl>