Example 3: Reshape

When I design longer_dt and wider_dt, I could find the pivot_longer and pivot_wider in tidyr and melt and dcast in data.table. Still, designing this API is not easy, as my goal is to let users use it with least pain. Here we would try to reproduce the results in the vignette of tidyr(https://tidyr.tidyverse.org/articles/pivot.html). First load the packages:

library(tidyfst)
library(tidyr)

Longer

First inspect the data:

relig_income
#> # A tibble: 18 × 11
#>    religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
#>    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Agnostic           27      34      60      81      76     137     122     109
#>  2 Atheist            12      27      37      52      35      70      73      59
#>  3 Buddhist           27      21      30      34      33      58      62      39
#>  4 Catholic          418     617     732     670     638    1116     949     792
#>  5 Don’t know/r…      15      14      15      11      10      35      21      17
#>  6 Evangelical …     575     869    1064     982     881    1486     949     723
#>  7 Hindu               1       9       7       9      11      34      47      48
#>  8 Historically…     228     244     236     238     197     223     131      81
#>  9 Jehovah's Wi…      20      27      24      24      21      30      15      11
#> 10 Jewish             19      19      25      25      30      95      69      87
#> 11 Mainline Prot     289     495     619     655     651    1107     939     753
#> 12 Mormon             29      40      48      51      56     112      85      49
#> 13 Muslim              6       7       9      10       9      23      16       8
#> 14 Orthodox           13      17      23      32      32      47      38      42
#> 15 Other Christ…       9       7      11      13      13      14      18      14
#> 16 Other Faiths       20      33      40      46      49      63      46      40
#> 17 Other World …       5       2       3       4       2       7       3       4
#> 18 Unaffiliated      217     299     374     365     341     528     407     321
#> # … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
#> #   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
#> #   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`
#> # ℹ Use `colnames()` to see all variable names

In tidyr, to get the longer format you need:

relig_income %>% 
  pivot_longer(-religion, names_to = "income", values_to = "count")

In tidyfst, we have:

relig_income %>% 
  longer_dt("religion",name = "income",value = "count")
#>                   religion             income count
#>                     <char>             <fctr> <num>
#>   1:              Agnostic              <$10k    27
#>   2:               Atheist              <$10k    12
#>   3:              Buddhist              <$10k    27
#>   4:              Catholic              <$10k   418
#>   5:    Don’t know/refused              <$10k    15
#>  ---                                               
#> 176:              Orthodox Don't know/refused    73
#> 177:       Other Christian Don't know/refused    18
#> 178:          Other Faiths Don't know/refused    71
#> 179: Other World Religions Don't know/refused     8
#> 180:          Unaffiliated Don't know/refused   597

Another example from tidyr:

billboard

# tidyr way:
 billboard %>%
   pivot_longer(
     cols = starts_with("wk"),
     names_to = "week",
     values_to = "rank",
     values_drop_na = TRUE
   )

# tidyfst way:
billboard %>% 
  longer_dt(-"wk",
            name = "week",
            value = "rank",
            na.rm = TRUE
            )
# regex could select the groups to keep, and minus could select the reverse

A warning would could come out because the merging column has different data types and do the coercion automatically.

Wider

## data
fish_encounters
#> # A tibble: 114 × 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # … with 104 more rows
#> # ℹ Use `print(n = ...)` to see more rows

## tidyr way:
fish_encounters %>% 
  pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 × 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#>  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
#>  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
#>  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#>  8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
#>  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#> 12 4857        1     1      1     1       1     1     1     1     1    NA    NA
#> 13 4858        1     1      1     1       1     1     1     1     1     1     1
#> 14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#> 15 4861        1     1      1     1       1     1     1     1     1     1     1
#> 16 4862        1     1      1     1       1     1     1     1     1    NA    NA
#> 17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

## tidyfst way:
fish_encounters %>% 
  wider_dt(name = "station",value = "seen")
#> Key: <fish>
#>       fish Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2
#>     <fctr>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int>
#>  1:   4842       1     1      1     1       1     1     1     1     1
#>  2:   4843       1     1      1     1       1     1     1     1     1
#>  3:   4844       1     1      1     1       1     1     1     1     1
#>  4:   4845       1     1      1     1       1    NA    NA    NA    NA
#>  5:   4847       1     1      1    NA      NA    NA    NA    NA    NA
#>  6:   4848       1     1      1     1      NA    NA    NA    NA    NA
#>  7:   4849       1     1     NA    NA      NA    NA    NA    NA    NA
#>  8:   4850       1     1     NA     1       1     1     1    NA    NA
#>  9:   4851       1     1     NA    NA      NA    NA    NA    NA    NA
#> 10:   4854       1     1     NA    NA      NA    NA    NA    NA    NA
#> 11:   4855       1     1      1     1       1    NA    NA    NA    NA
#> 12:   4857       1     1      1     1       1     1     1     1     1
#> 13:   4858       1     1      1     1       1     1     1     1     1
#> 14:   4859       1     1      1     1       1    NA    NA    NA    NA
#> 15:   4861       1     1      1     1       1     1     1     1     1
#> 16:   4862       1     1      1     1       1     1     1     1     1
#> 17:   4863       1     1     NA    NA      NA    NA    NA    NA    NA
#> 18:   4864       1     1     NA    NA      NA    NA    NA    NA    NA
#> 19:   4865       1     1      1    NA      NA    NA    NA    NA    NA
#> 2 variables not shown: [MAE <int>, MAW <int>]

# if no keeped groups are selected, use all except for name and value columns

If you want to fill with 0s, use:

fish_encounters %>% 
  wider_dt(name = "station",value = "seen",fill = 0)
#> Key: <fish>
#>       fish Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2
#>     <fctr>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int>
#>  1:   4842       1     1      1     1       1     1     1     1     1
#>  2:   4843       1     1      1     1       1     1     1     1     1
#>  3:   4844       1     1      1     1       1     1     1     1     1
#>  4:   4845       1     1      1     1       1     0     0     0     0
#>  5:   4847       1     1      1     0       0     0     0     0     0
#>  6:   4848       1     1      1     1       0     0     0     0     0
#>  7:   4849       1     1      0     0       0     0     0     0     0
#>  8:   4850       1     1      0     1       1     1     1     0     0
#>  9:   4851       1     1      0     0       0     0     0     0     0
#> 10:   4854       1     1      0     0       0     0     0     0     0
#> 11:   4855       1     1      1     1       1     0     0     0     0
#> 12:   4857       1     1      1     1       1     1     1     1     1
#> 13:   4858       1     1      1     1       1     1     1     1     1
#> 14:   4859       1     1      1     1       1     0     0     0     0
#> 15:   4861       1     1      1     1       1     1     1     1     1
#> 16:   4862       1     1      1     1       1     1     1     1     1
#> 17:   4863       1     1      0     0       0     0     0     0     0
#> 18:   4864       1     1      0     0       0     0     0     0     0
#> 19:   4865       1     1      1     0       0     0     0     0     0
#> 2 variables not shown: [MAE <int>, MAW <int>]

Note that the parameter of name and value should always be provided and should be explicit called (with the parameter names attached).

More complicated example

This example comes from data.table (https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html), and has been used in tidyr too. We’ll try to do it in tidyfst in this example. If we have a data.frame as below:

family <- fread("family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA")

family
#>    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
#>        <int>      <int>     <IDat>     <IDat>     <IDat>         <int>
#> 1:         1         30 1998-11-26 2000-01-29       <NA>             1
#> 2:         2         27 1996-06-22       <NA>       <NA>             2
#> 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2
#> 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1
#> 5:         5         29 2000-12-05 2005-02-28       <NA>             2
#> 2 variables not shown: [gender_child2 <int>, gender_child3 <int>]

And want to reshape the data.table to be like this:

#     family_id age_mother  child        dob gender
#         <int>      <int> <char>     <char> <char>
#  1:         1         30 child1 1998-11-26      1
#  2:         1         30 child2 2000-01-29      2
#  3:         1         30 child3       <NA>   <NA>
#  4:         2         27 child1 1996-06-22      2
#  5:         2         27 child2       <NA>   <NA>
#  6:         2         27 child3       <NA>   <NA>
#  7:         3         26 child1 2002-07-11      2
#  8:         3         26 child2 2004-04-05      2
#  9:         3         26 child3 2007-09-02      1
# 10:         4         32 child1 2004-10-10      1
# 11:         4         32 child2 2009-08-27      1
# 12:         4         32 child3 2012-07-21      1
# 13:         5         29 child1 2000-12-05      2
# 14:         5         29 child2 2005-02-28      1
# 15:         5         29 child3       <NA>   <NA>

The data.table::dcast and tidyr::pivot_longer could transfer it in one step, however, not so easy to understand. Here we’ll do it step by step to see what actually happens in this transfer.

family %>% 
  longer_dt(1:2) %>% 
  separate_dt("name",into = c("class","child")) %>% 
  wider_dt(-"class|value",
           name = "class",
           value = "value")
#> Key: <family_id, age_mother, child>
#>     family_id age_mother  child        dob     gender
#>         <int>      <int> <char>     <IDat>     <IDat>
#>  1:         1         30 child1 1998-11-26 1970-01-02
#>  2:         1         30 child2 2000-01-29 1970-01-03
#>  3:         1         30 child3       <NA>       <NA>
#>  4:         2         27 child1 1996-06-22 1970-01-03
#>  5:         2         27 child2       <NA>       <NA>
#>  6:         2         27 child3       <NA>       <NA>
#>  7:         3         26 child1 2002-07-11 1970-01-03
#>  8:         3         26 child2 2004-04-05 1970-01-03
#>  9:         3         26 child3 2007-09-02 1970-01-02
#> 10:         4         32 child1 2004-10-10 1970-01-02
#> 11:         4         32 child2 2009-08-27 1970-01-02
#> 12:         4         32 child3 2012-07-21 1970-01-02
#> 13:         5         29 child1 2000-12-05 1970-01-03
#> 14:         5         29 child2 2005-02-28 1970-01-02
#> 15:         5         29 child3       <NA>       <NA>

In such a process, we could find that we actually get a longer table, then separate it, and wider it later. tidyfst is not going to support the complicated transfer in one step, because it might be easier to implement, but much harder to understand 3 procedures in 1 step. If you still prefer that way, use data.table::dcast and tidyr::pivot_longer instead.