The package README uses the gapminder dataset to demonstrate nplyr’s functionality. In this case (and other similar cases) the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.
library(nplyr)
<-
gm_nest ::gapminder_unfiltered %>%
gapminder::nest(country_data = -continent)
tidyr
gm_nest#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [578 × 5]>
#> 2 Europe <tibble [1,302 × 5]>
#> 3 Africa <tibble [637 × 5]>
#> 4 Americas <tibble [470 × 5]>
#> 5 FSU <tibble [139 × 5]>
#> 6 Oceania <tibble [187 × 5]>
# we can use nplyr to perform operations on the nested data
%>%
gm_nest nest_filter(country_data, year == max(year)) %>%
nest_mutate(country_data, pop_millions = pop/1000000) %>%
slice_head(n = 1) %>%
::unnest(country_data)
tidyr#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
# in this case, we could have obtained the same result with tidyr and dplyr
%>%
gm_nest ::unnest(country_data) %>%
tidyrgroup_by(continent) %>%
filter(year == max(year)) %>%
mutate(pop_millions = pop/1000000) %>%
ungroup() %>%
filter(continent == "Asia")
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!
Consider a set of surveys that an organization might use to gather market data. It is common for organizations to have separate surveys for separate purposes but gather the same baseline set of data across all surveys (for example, a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:
In this scenario, both surveys are collecting demographic information
— age, location, and industry — but differ in the questions. A
convenient way to get the response files into the environment would be
to use purrr::map()
to read in each file to a nested data frame.
<- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"
path
<-
surveys ::tibble(survey_file = c("job_survey", "personal_survey")) %>%
tibblemutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))
surveys#> # A tibble: 2 × 2
#> survey_file survey_data
#> <chr> <list>
#> 1 job_survey <spec_tbl_df [500 × 6]>
#> 2 personal_survey <spec_tbl_df [750 × 6]>
tidyr::unnest()
can usually handle idiosyncrasies in layout when unnesting but in this
case unnesting throws an error!
%>%
surveys ::unnest(survey_data)
tidyr#> Error:
#> ! Can't combine `Q5` <double> and `Q5` <character>.
This is because the surveys share column names but not necessarily
column types! In this case, both data frames contain a column named
“Q5”, but in job_survey
it’s a double and in
personal_survey
it’s a character.
%>%
surveys slice(1) %>%
::unnest(survey_data) %>%
tidyrglimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1 <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2 <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3 <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4 <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5 <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
%>%
surveys slice(2) %>%
::unnest(survey_data) %>%
tidyrglimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1 <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2 <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3 <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4 <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5 <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…
We could potentially get around this issue with unnesting by reading
in all columns as characters via
readr::read_csv(x, col_types = cols(.default = "c"))
, but
this presents its own challenges. Q5
would still be better
represented as a double in job_survey
and from the survey
question text Q4
has similar, but distinctly different,
meanings across the survey files.
This is where nplyr comes into play! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.
<-
surveys %>%
surveys nest_mutate(survey_data,
age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
nest_group_by(survey_data, Q3) %>%
nest_add_count(survey_data,
name = "n_respondents_in_industry") %>%
nest_mutate(survey_data,
median_industry_age = median(Q1)) %>%
nest_ungroup(survey_data)
%>%
surveys slice(1) %>%
::unnest(survey_data)
tidyr#> # A tibble: 500 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_group
#> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 job_survey job 100 Austin Consulting Some… 163 Retireme…
#> 2 job_survey job 81 San Antonio Consulting Neit… 48 Retireme…
#> 3 job_survey job 51 Austin Consulting Extr… 190 Adult
#> 4 job_survey job 81 Austin Technology Extr… 25 Retireme…
#> 5 job_survey job 80 Dallas Consulting Extr… 143 Retireme…
#> 6 job_survey job 32 Fort Worth Energy Some… 233 Adult
#> 7 job_survey job 65 Dallas Consulting Some… 43 Retireme…
#> 8 job_survey job 57 Houston Healthcare Some… 243 Adult
#> 9 job_survey job 43 Dallas Government S… Neit… 158 Adult
#> 10 job_survey job 94 Fort Worth Healthcare Extr… 235 Retireme…
#> # … with 490 more rows, and 2 more variables: n_respondents_in_industry <int>,
#> # median_industry_age <dbl>
%>%
surveys slice(2) %>%
::unnest(survey_data)
tidyr#> # A tibble: 750 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_group
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 personal_survey personal 91 Austin Energy Neit… Blan… Retireme…
#> 2 personal_survey personal 32 San Antonio Healthca… Extr… Elem… Adult
#> 3 personal_survey personal 40 San Antonio Consulti… Some… Eget… Adult
#> 4 personal_survey personal 23 Austin Consulti… Extr… Scel… Adult
#> 5 personal_survey personal 88 Dallas Technolo… Neit… Aene… Retireme…
#> 6 personal_survey personal 69 Fort Worth Technolo… Neit… Inte… Retireme…
#> 7 personal_survey personal 96 Houston Healthca… Extr… Blan… Retireme…
#> 8 personal_survey personal 40 Houston Consulti… Extr… Scel… Adult
#> 9 personal_survey personal 57 Fort Worth Energy Extr… Pede… Adult
#> 10 personal_survey personal 40 Fort Worth Healthca… Extr… Phar… Adult
#> # … with 740 more rows, and 2 more variables: n_respondents_in_industry <int>,
#> # median_industry_age <dbl>