Abstract
groupdata2
is a set of methods for easy grouping,
windowing, folding, partitioning, splitting and balancing of data.
Create balanced folds for cross-validation or divide a time series into
windows. Balance group sizes with up- and downsampling or collapse
existing groups to fewer, balanced groups.
This vignette contains descriptions of functions and methods, along
with simple examples of usage. For a gentler introduction to groupdata2,
please see Introduction to
groupdata2
Contact author at r-pkgs@ludvigolsen.dk
You can either install the CRAN version or the GitHub development version.
# Uncomment:
# install.packages("groupdata2")
# Uncomment:
# install.packages("devtools")
# devtools::install_github("LudvigOlsen/groupdata2")
# Attaching groupdata2
library(groupdata2)
# Attaching other packages used in this vignette
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
# We will also be using plyr a few times, but we don't attach this
# because of possible conflicts with dplyr. Instead we use its functions
# like so: plyr::count()
groupdata2
is a set of functions and methods for easy
grouping, windowing, folding, partitioning, splitting and balancing of
data.
There are 7 main functions:
Returns a factor with group numbers,
e.g. c(1, 1, 1, 2, 2, 2, 3, 3, 3)
. This can be used to
subset, aggregate, group_by, etc.
Returns the given data as a data frame
with an added
grouping factor made with group_factor()
. The
data frame
is grouped by the grouping factor for easy use
with %>%
pipelines.
Splits the given data into the specified groups made with
group_factor()
and returns them in a list
.
Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or one numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Can create multiple unique fold columns at once, e.g. for repeated cross-validation.
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or one numerical variable. Make sure that all datapoints sharing an ID is in the same partition.
Collapses existing groups into a smaller set of groups with (optional) categorical, numerical, ID, and/or size balancing.
Uses up- or downsampling to fix the size of all groups to the min, max, mean, or median group size or to a specific number of rows. Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.
When working with time series we would often refer to the kind of
groups made by group_factor()
, group()
and
splt()
as windows. In this vignette, these
will be referred to as groups.
fold()
creates balanced groups for cross-validation by
using group()
. These are referred to as
folds.
partition()
creates balanced groups of (potentially)
different sizes, by using group()
. These are referred to as
partitions.
Cross-validation with
groupdata2
In this vignette, we go through the basics of cross-validation, such as
creating balanced train/test sets with partition()
and
balanced folds with fold()
. We also write up a simple
cross-validation function and compare multiple linear regression
models.
Time series with
groupdata2
In this vignette, we divide up a time series into groups (windows) and
subgroups using group()
with the greedy
and
staircase
methods. We do some basic descriptive stats of
each group and use them to reduce the data size.
Automatic groups with
groupdata2
In this vignette, we will use the l_starts
method with
group()
to allow transferring of information from one
dataset to another. We will use the automatic grouping function that
finds group starts all by itself.
In the examples, we will be using knitr::kable()
to
visualize some of the data such as data frames. You do not need to use
kable()
when using the functions in
groupdata2
.
There are currently 10 methods for grouping the data.
It is possible to create groups based on number of groups
(default), group size, list of group sizes, list
of group start positions, distance between members,
step size or prime number to start at. These can be
passed as whole number(s) or percentage(s), while method
l_starts
also accepts 'auto'
.
Here, we will take a look at the different methods.
greedy
uses group size for dividing up
the data.
Greedy refers to each group grabbing as many elements as possible (up to the specified size), meaning that there might be fewer elements available to the last group, but that all other groups than the last are guaranteed to have the size specified.
Example
We have a vector with 57 values. We want to have group sizes of 10.
The greedy splitter will return groups with this many values in them:
10, 10, 10, 10, 10, 7
By setting force_equal = TRUE
, we discard the last group
if it contains fewer values than the other groups.
Example
We have a vector with 57 values. We want to have group sizes of 10.
The greedy splitter with force_equal set to TRUE will return groups with this many values in them:
10, 10, 10, 10, 10
meaning that 7 values have been discarded.
n_dist
uses a specified number of groups to divide up
the data. First, it creates equal groups as large as possible. Then, it
distributed any excess data points across the groups.
Example
We have a vector with 57 values. We want to get back 5 groups.
‘n_dist’ with default settings would return groups with this many values in them:
11, 11, 12, 11, 12
By setting force_equal = TRUE
, n_dist
will
create the largest possible, equally sized groups by discarding excess
data elements.
Example
‘n_dist’ with force_equal set to TRUE would return groups with this many values in them:
11, 11, 11, 11, 11
meaning that 2 values have been discarded.
n_fill
uses a specified number of groups to divide up
the data.
It creates equal groups as large as possible and places any excess
data points in the first groups. By setting
descending = TRUE
, it would fill from the the last groups
instead.
Example
We have a vector with 57 values. We want to get back 5 groups.
‘n_fill’ with default settings would return groups with this many values in them:
12, 12, 11, 11, 11
By setting force_equal = TRUE
, n_fill
will
create the largest possible, equally sized groups by discarding excess
data elements.
Example
‘n_fill’ with force_equal set to TRUE would return groups with this many values in them:
11, 11, 11, 11, 11
meaning that 2 values have been discarded.
n_last
uses a specified number of groups to divide up
the data.
With default settings, it tries to make the groups as equally sized as possible, but notice that the last group might contain fewer or more elements, if the length of the data is not divisible with the number of groups. All groups, except the last, are guaranteed to contain the same number of elements.
Example
We have a vector with 57 values. We want to get back 5 groups.
‘n_last’ with default settings would return groups with this many values in them:
11, 11, 11, 11, 13
By setting force_equal = TRUE
, n_last
will
create the largest possible, equally sized groups by discarding excess
data elements.
Example
‘n_last’ with force_equal set to TRUE would return groups with this many values in them:
11, 11, 11, 11, 11
meaning that 2 values have been discarded.
Notice that n_last
will always return the given number
of groups. It will never return a group with zero elements. For some
situations that means that the last group will contain a lot of
elements. Asked to divide a vector with 57 elements into 20 groups, the
first 19 groups will contain 2 elements, while the last group will
itself contain 19 elements. Had we instead asked it to divide the vector
into 19 groups, we would have had 3 elements in all groups.
n_rand
uses a specified number of groups to divide up
the data.
It creates equal groups as large as possible and places any excess data points randomly in the groups (max. one extra element per group).
Example
We have a vector with 57 values. We want to get back 5 groups.
‘n_rand’ with default settings could return groups with this many values in them:
12, 11, 11, 11, 12
By setting force_equal = TRUE
, n_rand
will
create the largest possible, equally sized groups by discarding excess
data elements.
Example
‘n_rand’ with force_equal set to TRUE would return groups with this many values in them:
11, 11, 11, 11, 11
meaning that 2 values have been discarded.
l_sizes
divides up the data by a list of group
sizes.
Excess data points are placed in extra group at the end.
n
is a list
/vector
of group
sizes
Example
We have a vector with 57 values. We want to get back 3 groups containing 20%, 30% and 50% of the data points.
‘l_sizes’ with n = c(0.2, 0.3) would return groups with this many values in them:
11, 17, 29
By setting force_equal = TRUE
, l_sizes
discard any excess elements.
Example
‘l_sizes’ with n = c(0.2, 0.3) and force_equal set to TRUE would return groups with this many values in them:
11, 17
meaning that 29 values have been discarded.
l_starts
starts new groups at specified values of
vector.
n
is a list
of starting positions. Skip
values by c(value, skip_to_number)
where
skip_to_number
is the nth appearance of the value in the
vector
. Groups automatically start from first data
point.
If passing n = 'auto'
the starting positions are
automatically found with find_starts()
.
If data is a data frame
, starts_col
must be
set to indicate the column to match starts.
Set starts_col
to 'index'
or
'.index'
for matching with row names. 'index'
first looks for column named 'index'
in data, while
'.index'
completely ignores any potential column in data
named '.index'
.
Example
We have a vector with 57 values ranging from (1:57). We want to get back groups starting at specific values in the vector.
‘l_starts’ with n = c(1, 3, 7, 25, 50) would return groups with this many values in them:
2, 4, 18, 25, 8
force_equal
does not have any effect with method
l_starts
.
Groups can start at nth appearance of the value by using
c(value, skip_to_number)
.
Example
We have a vector with the values c(“a”, “e”, “o”, “a”, “e”, “o”) and want to start groups at the first “a”, the first following “e” and the second following “o”.
‘l_starts’ with n = list(“a”, “e”, c(“o”, 2)) would return groups with this many values in them:
1, 4, 1
Using the find_starts()
function, l_starts
is capable of finding the beginning of groups automatically.
A group start is a value which differs from the previous value.
Example
We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and want to automatically discover groups of data and group them.
‘l_starts’ with n = ‘auto’ would return groups with this many values in them:
2, 3, 2
find_starts()
finds group starts in a given
vector.
A group start is a value which differs from the previous value.
Setting return_index = TRUE
returns indices of group
starts.
Example
We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and want to automatically discover group starts.
find_starts() would return these group starts:
“a”, “o”, “a”
find_missing_starts()
tells you the values and
(optionally) skip-to numbers that would be recursively removed when
using the l_starts
method with the
remove_missing_starts
argument set to
TRUE
.
Set return_skip_numbers = FALSE
to get only the missing
values without the skip-to numbers.
Example
We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and a vector of starting positions c(“a”,“d”,“o”,“p”,“a”).
find_missing_starts() would return this list of values and skip_to numbers:
list(c(“d”,1), c(“p”,1))
every
greedily combines every n
th data
point.
Example
We have a vector with 57 values. We want to get back 5 groups.
‘every’ with default settings would return groups with this many values in them:
12, 12, 11, 11, 11
By setting force_equal = TRUE
, every
discards the last length % n
data points such that all
groups have the same size.
Example
‘every’ with force_equal set to TRUE would return groups with this many values in them:
11, 11, 11, 11, 11
meaning that 2 values have been discarded.
staircase
uses a step size to divide up the data.
For each group, the group size will be step size multiplied with the group index.
Example
We have a vector with 57 values. We specify a step size of 5.
‘staircase’ with default settings would return groups with this many values in them:
5, 10, 15, 20, 7
By setting force_equal = TRUE
, staircase
will discard the last group if it does not contain the expected values
(step size multiplied by group index).
Example
‘staircase’ with force_equal set to TRUE would return groups with this many values in them:
5, 10, 15, 20
meaning that 7 values have been discarded.
When using the staircase
method, the last group might
not have the size of the second last group + step size. Use
%staircase%
to find the remainder.
If the last group has the size of the second last group + step size,
%staircase%
will return 0
.
Example
%staircase% on a vector with size 57 and step size of 5 would look like this:
57 %staircase% 5
and return:
7
meaning that the last group would contain 7 values
primes
creates groups with sizes of primary numbers in a
staircasing design.
n
is the prime number to start at (size of first
group).
Prime numbers are generated with the numbers
package by Hans Werner Borchers.
Example
We have a vector with 57 values. We specify n (start at) as 5.
‘primes’ with default settings would return groups with this many values in them:
5, 7, 11, 13, 17, 4
By setting force_equal = TRUE
, primes
will
discard the last group if it does not contain the expected number of
values.
Example
‘primes’ with force_equal set to TRUE would return groups with this many values in them:
5, 7, 11, 13, 17
meaning that 4 values have been discarded.
When using the primes
method, the last group might not
have the size of the associated prime number, if there are not enough
elements. Use %primes%
to find the remainder.
Returns 0
if the last group has the size of the
associated prime number.
Example
%primes% on a vector with size 57 and n (start at) as 5 would look like this:
57 %primes% 5
and return:
4
meaning that the last group would contain 4 values
There are currently 4 methods for balancing on ID level in
balance()
.
Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.
Attempts to level the number of rows per category, while only
removing/adding entire IDs.
This is done in 2 steps:
If a category needs to add all its rows one or more times, the data is repeated.
Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled.
Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others.
Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.
These are the arguments for group_factor()
,
group()
, splt()
, fold()
,
partition()
Type: data frame
or vector
The data to process.
Used in: group_factor()
, group()
,
splt()
, fold()
, partition()
Type: integer
, numeric
,
character
, or list
n
represents either number of groups (default), group
size, list of group sizes, list of group starts, step size or prime
number to start at, depending on which method is
specified.
n
can be given as a whole number(s) (n
> 1) or as percentage(s) (0 < n < 1).
Method l_starts
allows n = 'auto'
.
Used in: group_factor()
, group()
,
splt()
Type: character
Choose which method to use when dividing up the data.
Available methods: greedy
, n_dist
,
n_fill
, n_last
, n_rand
,
l_starts
, l_sizes
, staircase
, or
primes
Used in: group_factor()
, group()
,
splt()
, fold()
Type: character
Name of column with values to match in method l_starts
when data
is a data frame
.
Pass 'index'
or '.index'
to use rownames.
'index'
first looks for column named 'index'
in data
, while '.index'
completely ignores any
potential column in data named '.index'
.
Used in: group_factor()
, group()
,
splt()
Type: logical
(TRUE
or
FALSE
)
If you need groups with the exact same size, set
force_equal = TRUE
.
Implementation is different in the different methods. Read more in their
sections above.
Be aware that this setting discards excess
datapoints!
Used in: group_factor()
, group()
,
splt()
, partition()
Type: logical
(TRUE
or
FALSE
)
If you set n
to 0
, you get an error.
If you don’t want this behavior, you can set
allow_zero = TRUE
, and (depending on the function) you will
get the following output:
group_factor()
will return the factor with
NA
s instead of numbers. It will be the same length as
expected.
group()
will return the expected data frame with
NA
s instead of a grouping factor.
splt()
functions will return the given data
(data frame
or vector
) in the same
list
format as if it had been split.
Used in: group_factor()
, group()
,
splt()
Type: logical
(TRUE
or
FALSE
)
In methods like n_fill
where it makes sense to change
the direction of the method, you can use this argument.
In n_fill
it fills up the excess data points starting from
the last group instead of the first.
NB. Only some of the methods can use this argument.
Used in: group_factor()
, group()
,
splt()
Type: logical
(TRUE
or
FALSE
)
After creating the the grouping factor using the chosen method, it is
possible to randomly reorganize it before returning it. Notice that this
applies to all the functions that allows for the
argument, as group()
and splt()
uses
the grouping factor!
Used in: group_factor()
, group()
,
splt()
N.B. fold()
and partition()
always uses
some randomization.
Type: character
Name of added grouping factor column. Allows multiple grouping
factors in a data frame
.
Used in: group()
Type: logical
(TRUE
or
FALSE
)
Recursively remove elements from the list of starts that are not
found. For method l_starts
only.
Used in: group_factor()
, group()
,
splt()
Type: integer
or numeric
k
represents either number of folds (default), fold
size, or step size, depending on which method is
specified.
k
can be given as a whole number (k >
1) or as a percentage (0 < k < 1).
Used in: fold()
Type: integer
or numeric
Size(s) of partition(s). Passed as vector
if specifying
multiple partitions.
p
can be given as whole number(s) (p >
1) or as percentage(s) (0 < p < 1).
Used in: partition()
Type: categorical vector
or factor
(passed
as column name)
Categorical variable to balance between the groups.
E.g. when predicting a binary variable (‘a’ or ‘b’), we usually want both classes represented in every fold and partition.
N.B. If also passing id_col
,
cat_col
should be a constant within IDs.
E.g. a participant must always have the same diagnosis (‘a’ or ‘b’)
throughout the dataset. Otherwise, the participant might be placed in
multiple folds.
Used in: fold()
, partition()
Type: numerical vector
(passed as column name)
Numerical variable to balance between groups.
N.B. When used with id_col
, values for
each ID are aggregated using id_aggregation_fn
before being
balanced.
N.B. When passing num_col
, the
method
argument is not used.
Used in: fold()
, partition()
Type: factor
(passed as column name)
factor
with IDs. This will be used to keep all rows with
an ID in the same group (if possible).
E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold/partition.
Used in: fold()
, partition()
Type: function
Function for aggregating values in num_col
for each ID,
before balancing by num_col
.
N.B. Only used when num_col
and
id_col
are both specified.
Used in: fold()
, partition()
Type: integer
or numeric
How many levels of extreme pairing to do when balancing groups by
num_col
.
Extreme pairing: Rows/pairs are ordered as smallest,
largest, second smallest, second largest, etc. If
extreme_pairing_levels > 1
, this is done “recursively”
on the extreme pairs.
N.B. Values greater than 1
works best
with large datasets. Always check if an increase actually makes the
groups more balanced. There are examples of how to do this, and more
detailed descriptions of the implementations, in the functions’ help
files (?fold
and ?partition
).
Used in: fold()
, partition()
Type: integer
or numeric
Number of fold columns to create. This is useful for repeated
cross-validation. If num_fold_cols > 1
, columns will be
named ".folds_1"
, ".folds_2"
, etc. Otherwise
simply ".folds"
.
N.B. If unique_fold_cols_only
is
TRUE
, we can end up with fewer columns than specified, see
max_iters
.
N.B. If data
has existing fold columns,
see handle_existing_fold_cols
.
Used in: fold()
Type: logical
(TRUE
or
FALSE
)
Check if the fold columns are identical and keep only the unique columns.
N.B. As the number of column comparisons can be time
consuming, we can run this part in parallel. See
parallel
.
N.B. We can end up with fewer columns than specified
in num_fold_cols
, see max_iters
.
N.B. Only used when
num_fold_cols > 1
or data
has existing fold
columns.
Used in: fold()
Type: logical
(TRUE
or
FALSE
)
Maximum number of attempts at reaching num_fold_cols
unique fold columns.
When only keeping the unique fold columns, we risk having fewer
columns than expected. Hence, we repeatedly create the missing columns
and remove those that are not unique. This is done until we have
num_fold_cols
unique fold columns, or we have attempted
max_iters
times. In some cases, it is not possible to
create num_fold_cols
unique combinations of the dataset,
e.g. when specifying cat_col
, id_col
and
num_col
.
max_iters
specifies when to stop trying.
N.B. We can end up with fewer columns than specified
in num_fold_cols
.
N.B. Only used
num_fold_cols > 1
.
Used in: fold()
Type: character
How to handle existing fold columns. Either "keep_warn"
,
"keep"
, or "remove"
.
To add extra fold columns, use "keep"
or
"keep_warn"
. Note that existing fold columns might be
renamed.
To replace the existing fold columns, use "remove"
.
Used in: fold()
Type: logical
(TRUE
or
FALSE
)
Whether to parallelize the fold column comparisons, when
unique_fold_cols_only
is TRUE
.
N.B. Requires a registered parallel backend. Like
with doParallel:registerDoParallel
.
Used in: fold()
Type: logical
(TRUE
or
FALSE
)
Return list
of partitions (TRUE
) or a
grouped data frame
(FALSE
).
Used in: partition()
These are the arguments for balance()
,
upsample()
, downsample()
Type: data frame
The data to process.
Used in: balance()
, upsample()
,
downsample()
Type: character
, numeric
Size to balance categories to.
Either a specific number, given as a whole number, or one of the
following strings: "min"
, "max"
,
"mean"
, "median"
.
Used in: balance()
, upsample()
,
downsample()
Type: categorical vector
or factor
(passed
as column name)
Categorical variable to balance sample size by.
Used in: balance()
, upsample()
,
downsample()
Type: factor
(passed as column name)
factor
with IDs. IDs are considered entities,
e.g. allowing us to add or remove all rows for an ID. How this is used
is up to the id_method
.
E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements together. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant.
Used in: balance()
, upsample()
,
downsample()
Type: character
Method for balancing the IDs.
Available ID methods are: n_ids
, n_rows_c
,
or nested
.
Used in: balance()
, upsample()
,
downsample()
Type: logical
Whether to add a column with 1
s for added rows, and
0
s for original rows.
Used in: balance()
, upsample()
,
downsample()
Type: character
Name of column marking new rows.
Used in: balance()
, upsample()
,
downsample()
We will be using the method n_dist
on a
data frame
to showcase the functions. Afterwards we will
use and compare the methods.
Note, that you can also use vector
s as data with all the
functions.
See the necessary attached packages for running the examples under Attach Packages.
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
group_factor()
<- group_factor(df, 5, method = 'n_dist')
groups
groups#> [1] 1 1 2 2 3 3 3 4 4 5 5 5
#> Levels: 1 2 3 4 5
$groups <- groups
df
%>% kable(align = 'c') df
x | species | age | groups |
---|---|---|---|
1 | cat | 63 | 1 |
2 | pig | 7 | 1 |
3 | human | 21 | 2 |
4 | cat | 18 | 2 |
5 | pig | 66 | 3 |
6 | human | 37 | 3 |
7 | cat | 73 | 3 |
8 | pig | 47 | 4 |
9 | human | 67 | 4 |
10 | cat | 91 | 5 |
11 | pig | 35 | 5 |
12 | human | 70 | 5 |
%>%
df group_by(groups) %>%
summarize(mean_age = mean(age)) %>%
kable(align = 'c')
groups | mean_age |
---|---|
1 | 35.00000 |
2 | 19.50000 |
3 | 58.66667 |
4 | 57.00000 |
5 | 65.33333 |
Getting an equal number of elements per group with
group_factor()
.
Notice that we discard the excess values so all groups contain the same amount of elements. Since the grouping factor is shorter than the data frame, we can’t combine them as they are. A way to do so would be to shorten the data frame to be the same length as the grouping factor.
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
group_factor()
with force_equal
<- group_factor(df, 5, method = 'n_dist', force_equal = TRUE)
groups
groups#> [1] 1 1 2 2 3 3 4 4 5 5
#> Levels: 1 2 3 4 5
::count(groups) %>%
plyrrename(group = x, size = freq) %>%
kable(align = 'c')
group | size |
---|---|
1 | 2 |
2 | 2 |
3 | 2 |
4 | 2 |
5 | 2 |
data frame
and grouping factorFirst we make the data frame
the same size as the
grouping factor. Then we add the grouping factor to the
data frame
.
<- head(df, length(groups)) %>%
df mutate(group = groups)
%>% kable(align = 'c') df
x | species | age | group |
---|---|---|---|
1 | cat | 94 | 1 |
2 | pig | 22 | 1 |
3 | human | 64 | 2 |
4 | cat | 13 | 2 |
5 | pig | 26 | 3 |
6 | human | 37 | 3 |
7 | cat | 2 | 4 |
8 | pig | 36 | 4 |
9 | human | 81 | 5 |
10 | cat | 31 | 5 |
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
group()
<- group(df, 5, method = 'n_dist')
df_grouped
%>% kable(align = 'c') df_grouped
x | species | age | .groups |
---|---|---|---|
1 | cat | 50 | 1 |
2 | pig | 19 | 1 |
3 | human | 82 | 2 |
4 | cat | 65 | 2 |
5 | pig | 77 | 3 |
6 | human | 11 | 3 |
7 | cat | 69 | 3 |
8 | pig | 39 | 4 |
9 | human | 76 | 4 |
10 | cat | 59 | 5 |
11 | pig | 71 | 5 |
12 | human | 100 | 5 |
2.2 Using group()
in magrittr
pipeline to
get mean age
<- df %>%
df_means group(5, method = 'n_dist') %>%
summarise(mean_age = mean(age))
%>% kable(align = 'c') df_means
.groups | mean_age |
---|---|
1 | 34.50000 |
2 | 73.50000 |
3 | 52.33333 |
4 | 57.50000 |
5 | 76.66667 |
Getting an equal number of elements per group with
group()
.
Notice that we discard the excess rows/elements so all groups contain the same amount of elements.
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
group()
with force_equal
<- df %>%
df_grouped group(5, method = 'n_dist', force_equal = TRUE)
%>% kable(align = 'c') df_grouped
x | species | age | .groups |
---|---|---|---|
1 | cat | 53 | 1 |
2 | pig | 79 | 1 |
3 | human | 3 | 2 |
4 | cat | 47 | 2 |
5 | pig | 71 | 3 |
6 | human | 66 | 3 |
7 | cat | 45 | 4 |
8 | pig | 81 | 4 |
9 | human | 41 | 5 |
10 | cat | 23 | 5 |
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
splt()
<- splt(df, 5, method = 'n_dist')
df_list
%>% kable(align = 'c') df_list
|
|
|
|
|
splt()
without using kable()
to visualize it.splt()
uses base::split()
to split the data by
the grouping factor.<- c(1:6)
v
splt(v, 3, method = 'n_dist')
#> $`1`
#> [1] 1 2
#>
#> $`2`
#> [1] 3 4
#>
#> $`3`
#> [1] 5 6
Getting an equal number of elements per group with
splt()
.
Notice that we discard the excess rows/elements so all groups contain the same amount of elements.
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
splt()
with force_equal
<- splt(df, 5, method = 'n_dist', force_equal = TRUE)
df_list
%>% kable(align = 'c') df_list
|
|
|
|
|
data frame
<- data.frame(
df "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)),
"score" = sample(c(1:100), 3 * 6)
)
<- df %>%
df arrange(participant)
# Remove index
rownames(df) <- NULL
# Add session info
$session <- rep(c('1','2', '3'), 6)
df
kable(df, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 44 | a | 72 | 1 |
1 | 44 | a | 61 | 2 |
1 | 44 | a | 92 | 3 |
2 | 71 | b | 13 | 1 |
2 | 71 | b | 82 | 2 |
2 | 71 | b | 53 | 3 |
3 | 40 | a | 25 | 1 |
3 | 40 | a | 100 | 2 |
3 | 40 | a | 57 | 3 |
4 | 32 | a | 14 | 1 |
4 | 32 | a | 73 | 2 |
4 | 32 | a | 31 | 3 |
5 | 73 | b | 24 | 1 |
5 | 73 | b | 41 | 2 |
5 | 73 | b | 23 | 3 |
6 | 20 | b | 6 | 1 |
6 | 20 | b | 37 | 2 |
6 | 20 | b | 83 | 3 |
fold()
without balancing<- fold(df, 3, method = 'n_dist')
df_folded
# Order by folds
<- df_folded %>%
df_folded arrange(.folds)
kable(df_folded, align = 'c')
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
1 | 44 | a | 61 | 2 | 1 |
1 | 44 | a | 92 | 3 | 1 |
4 | 32 | a | 73 | 2 | 1 |
4 | 32 | a | 31 | 3 | 1 |
5 | 73 | b | 24 | 1 | 1 |
6 | 20 | b | 83 | 3 | 1 |
1 | 44 | a | 72 | 1 | 2 |
2 | 71 | b | 13 | 1 | 2 |
3 | 40 | a | 100 | 2 | 2 |
4 | 32 | a | 14 | 1 | 2 |
5 | 73 | b | 41 | 2 | 2 |
6 | 20 | b | 6 | 1 | 2 |
2 | 71 | b | 82 | 2 | 3 |
2 | 71 | b | 53 | 3 | 3 |
3 | 40 | a | 25 | 1 | 3 |
3 | 40 | a | 57 | 3 | 3 |
5 | 73 | b | 23 | 3 | 3 |
6 | 20 | b | 37 | 2 | 3 |
fold()
with balancing but without id_col<- fold(df, 3, cat_col = 'diagnosis', method = 'n_dist')
df_folded
# Order by folds
<- df_folded %>%
df_folded arrange(.folds)
kable(df_folded, align = 'c')
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
1 | 44 | a | 61 | 2 | 1 |
3 | 40 | a | 25 | 1 | 1 |
3 | 40 | a | 57 | 3 | 1 |
2 | 71 | b | 13 | 1 | 1 |
5 | 73 | b | 41 | 2 | 1 |
6 | 20 | b | 6 | 1 | 1 |
1 | 44 | a | 72 | 1 | 2 |
1 | 44 | a | 92 | 3 | 2 |
4 | 32 | a | 14 | 1 | 2 |
2 | 71 | b | 53 | 3 | 2 |
5 | 73 | b | 24 | 1 | 2 |
6 | 20 | b | 37 | 2 | 2 |
3 | 40 | a | 100 | 2 | 3 |
4 | 32 | a | 73 | 2 | 3 |
4 | 32 | a | 31 | 3 | 3 |
2 | 71 | b | 82 | 2 | 3 |
5 | 73 | b | 23 | 3 | 3 |
6 | 20 | b | 83 | 3 | 3 |
Let’s count how many of each diagnosis there are in each group.
%>%
df_folded count(.folds, diagnosis) %>%
kable(align='c')
.folds | diagnosis | n |
---|---|---|
1 | a | 3 |
1 | b | 3 |
2 | a | 3 |
2 | b | 3 |
3 | a | 3 |
3 | b | 3 |
fold()
with id_col
but without
balancing<- fold(df, 3, id_col = 'participant', method = 'n_dist')
df_folded
# Order by folds
<- df_folded %>%
df_folded arrange(.folds)
# Remove index (Looks prettier in the table!)
rownames(df_folded) <- NULL
kable(df_folded, align = 'c')
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
3 | 40 | a | 25 | 1 | 1 |
3 | 40 | a | 100 | 2 | 1 |
3 | 40 | a | 57 | 3 | 1 |
5 | 73 | b | 24 | 1 | 1 |
5 | 73 | b | 41 | 2 | 1 |
5 | 73 | b | 23 | 3 | 1 |
2 | 71 | b | 13 | 1 | 2 |
2 | 71 | b | 82 | 2 | 2 |
2 | 71 | b | 53 | 3 | 2 |
6 | 20 | b | 6 | 1 | 2 |
6 | 20 | b | 37 | 2 | 2 |
6 | 20 | b | 83 | 3 | 2 |
1 | 44 | a | 72 | 1 | 3 |
1 | 44 | a | 61 | 2 | 3 |
1 | 44 | a | 92 | 3 | 3 |
4 | 32 | a | 14 | 1 | 3 |
4 | 32 | a | 73 | 2 | 3 |
4 | 32 | a | 31 | 3 | 3 |
Let’s see how participants were distributed in the groups.
%>%
df_folded count(.folds, participant) %>%
kable(align='c')
.folds | participant | n |
---|---|---|
1 | 3 | 3 |
1 | 5 | 3 |
2 | 2 | 3 |
2 | 6 | 3 |
3 | 1 | 3 |
3 | 4 | 3 |
fold()
with balancing and with
id_col
fold()
first divides up the data frame by
cat_col
and then create n
folds for both
diagnoses. As there are only 3 participants per diagnosis, we can
maximally create 3 folds in this scenario.
<- fold(df, 3, cat_col = 'diagnosis', id_col = 'participant', method = 'n_dist')
df_folded
# Order by folds
<- df_folded %>%
df_folded arrange(.folds)
kable(df_folded, align = 'c')
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
1 | 44 | a | 72 | 1 | 1 |
1 | 44 | a | 61 | 2 | 1 |
1 | 44 | a | 92 | 3 | 1 |
6 | 20 | b | 6 | 1 | 1 |
6 | 20 | b | 37 | 2 | 1 |
6 | 20 | b | 83 | 3 | 1 |
3 | 40 | a | 25 | 1 | 2 |
3 | 40 | a | 100 | 2 | 2 |
3 | 40 | a | 57 | 3 | 2 |
5 | 73 | b | 24 | 1 | 2 |
5 | 73 | b | 41 | 2 | 2 |
5 | 73 | b | 23 | 3 | 2 |
4 | 32 | a | 14 | 1 | 3 |
4 | 32 | a | 73 | 2 | 3 |
4 | 32 | a | 31 | 3 | 3 |
2 | 71 | b | 13 | 1 | 3 |
2 | 71 | b | 82 | 2 | 3 |
2 | 71 | b | 53 | 3 | 3 |
Let’s count how many of each diagnosis there are in each group and find which participants are in which groups.
%>%
df_folded count(.folds, diagnosis, participant) %>%
kable(align='c')
.folds | diagnosis | participant | n |
---|---|---|---|
1 | a | 1 | 3 |
1 | b | 6 | 3 |
2 | a | 3 | 3 |
2 | b | 5 | 3 |
3 | a | 4 | 3 |
3 | b | 2 | 3 |
data frame
<- data.frame(
df "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)),
"score" = sample(c(1:100), 3 * 6)
)
<- df %>% arrange(participant)
df
# Remove index
rownames(df) <- NULL
# Add session info
$session <- rep(c('1','2', '3'), 6)
df
kable(df, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 33 | a | 28 | 1 |
1 | 33 | a | 83 | 2 |
1 | 33 | a | 56 | 3 |
2 | 45 | b | 22 | 1 |
2 | 45 | b | 39 | 2 |
2 | 45 | b | 74 | 3 |
3 | 50 | a | 100 | 1 |
3 | 50 | a | 6 | 2 |
3 | 50 | a | 97 | 3 |
4 | 18 | a | 87 | 1 |
4 | 18 | a | 31 | 2 |
4 | 18 | a | 34 | 3 |
5 | 51 | b | 43 | 1 |
5 | 51 | b | 66 | 2 |
5 | 51 | b | 32 | 3 |
6 | 8 | b | 75 | 1 |
6 | 8 | b | 91 | 2 |
6 | 8 | b | 95 | 3 |
partition()
without balancing<- partition(df, 0.3, list_out = FALSE)
df_partitioned
# Order by partitions
<- df_partitioned %>%
df_partitioned arrange(.partitions)
# Partition Sizes
%>%
df_partitioned count(.partitions) %>%
kable(align = 'c')
.partitions | n |
---|---|
1 | 5 |
2 | 13 |
kable(df_partitioned, align = 'c')
participant | age | diagnosis | score | session | .partitions |
---|---|---|---|---|---|
2 | 45 | b | 39 | 2 | 1 |
2 | 45 | b | 74 | 3 | 1 |
4 | 18 | a | 87 | 1 | 1 |
6 | 8 | b | 91 | 2 | 1 |
6 | 8 | b | 95 | 3 | 1 |
1 | 33 | a | 28 | 1 | 2 |
1 | 33 | a | 83 | 2 | 2 |
1 | 33 | a | 56 | 3 | 2 |
2 | 45 | b | 22 | 1 | 2 |
3 | 50 | a | 100 | 1 | 2 |
3 | 50 | a | 6 | 2 | 2 |
3 | 50 | a | 97 | 3 | 2 |
4 | 18 | a | 31 | 2 | 2 |
4 | 18 | a | 34 | 3 | 2 |
5 | 51 | b | 43 | 1 | 2 |
5 | 51 | b | 66 | 2 | 2 |
5 | 51 | b | 32 | 3 | 2 |
6 | 8 | b | 75 | 1 | 2 |
partition()
with balancing but without
id_col
<- partition(df, 0.3, cat_col = 'diagnosis', list_out = FALSE)
df_partitioned
# Order by partitions
<- df_partitioned %>%
df_partitioned arrange(.partitions)
kable(df_partitioned, align = 'c')
participant | age | diagnosis | score | session | .partitions |
---|---|---|---|---|---|
1 | 33 | a | 56 | 3 | 1 |
3 | 50 | a | 6 | 2 | 1 |
2 | 45 | b | 39 | 2 | 1 |
5 | 51 | b | 66 | 2 | 1 |
1 | 33 | a | 28 | 1 | 2 |
1 | 33 | a | 83 | 2 | 2 |
3 | 50 | a | 100 | 1 | 2 |
3 | 50 | a | 97 | 3 | 2 |
4 | 18 | a | 87 | 1 | 2 |
4 | 18 | a | 31 | 2 | 2 |
4 | 18 | a | 34 | 3 | 2 |
2 | 45 | b | 22 | 1 | 2 |
2 | 45 | b | 74 | 3 | 2 |
5 | 51 | b | 43 | 1 | 2 |
5 | 51 | b | 32 | 3 | 2 |
6 | 8 | b | 75 | 1 | 2 |
6 | 8 | b | 91 | 2 | 2 |
6 | 8 | b | 95 | 3 | 2 |
Let’s count how many of each diagnosis there are in each partition.
%>%
df_partitioned count(.partitions, diagnosis) %>%
kable(align='c')
diagnosis | .partitions | n |
---|---|---|
a | 1 | 2 |
a | 2 | 7 |
b | 1 | 2 |
b | 2 | 7 |
partition()
with id_col
but without
balancing<- partition(df, 0.5, id_col = 'participant', list_out = FALSE)
df_partitioned
# Order by partitions
<- df_partitioned %>%
df_partitioned arrange(.partitions)
kable(df_partitioned, align = 'c')
participant | age | diagnosis | score | session | .partitions |
---|---|---|---|---|---|
4 | 18 | a | 87 | 1 | 1 |
4 | 18 | a | 31 | 2 | 1 |
4 | 18 | a | 34 | 3 | 1 |
5 | 51 | b | 43 | 1 | 1 |
5 | 51 | b | 66 | 2 | 1 |
5 | 51 | b | 32 | 3 | 1 |
6 | 8 | b | 75 | 1 | 1 |
6 | 8 | b | 91 | 2 | 1 |
6 | 8 | b | 95 | 3 | 1 |
1 | 33 | a | 28 | 1 | 2 |
1 | 33 | a | 83 | 2 | 2 |
1 | 33 | a | 56 | 3 | 2 |
2 | 45 | b | 22 | 1 | 2 |
2 | 45 | b | 39 | 2 | 2 |
2 | 45 | b | 74 | 3 | 2 |
3 | 50 | a | 100 | 1 | 2 |
3 | 50 | a | 6 | 2 | 2 |
3 | 50 | a | 97 | 3 | 2 |
Let’s see how participants were distributed in the partitions.
%>%
df_partitioned count(.partitions, participant) %>%
kable(align='c')
.partitions | participant | n |
---|---|---|
1 | 4 | 3 |
1 | 5 | 3 |
1 | 6 | 3 |
2 | 1 | 3 |
2 | 2 | 3 |
2 | 3 | 3 |
partition()
with cat_col
and with
id_col
partition()
first divides up the data frame by
cat_col
and then create n
partitions for both
diagnoses.
<- partition(
df_partitioned data = df,
p = 0.5,
cat_col = 'diagnosis',
id_col = 'participant',
list_out = FALSE
)
# Order by folds
<- df_partitioned %>%
df_partitioned arrange(.partitions)
kable(df_partitioned, align = 'c')
participant | age | diagnosis | score | session | .partitions |
---|---|---|---|---|---|
3 | 50 | a | 100 | 1 | 1 |
3 | 50 | a | 6 | 2 | 1 |
3 | 50 | a | 97 | 3 | 1 |
2 | 45 | b | 22 | 1 | 1 |
2 | 45 | b | 39 | 2 | 1 |
2 | 45 | b | 74 | 3 | 1 |
1 | 33 | a | 28 | 1 | 2 |
1 | 33 | a | 83 | 2 | 2 |
1 | 33 | a | 56 | 3 | 2 |
4 | 18 | a | 87 | 1 | 2 |
4 | 18 | a | 31 | 2 | 2 |
4 | 18 | a | 34 | 3 | 2 |
5 | 51 | b | 43 | 1 | 2 |
5 | 51 | b | 66 | 2 | 2 |
5 | 51 | b | 32 | 3 | 2 |
6 | 8 | b | 75 | 1 | 2 |
6 | 8 | b | 91 | 2 | 2 |
6 | 8 | b | 95 | 3 | 2 |
Let’s count how many of each diagnosis there are in each group and find which participants are in which partitions.
%>%
df_partitioned count(.partitions, diagnosis, participant) %>%
kable(align='c')
.partitions | diagnosis | participant | n |
---|---|---|---|
1 | a | 3 | 3 |
1 | b | 2 | 3 |
2 | a | 1 | 3 |
2 | a | 4 | 3 |
2 | b | 5 | 3 |
2 | b | 6 | 3 |
dataframe
<- data.frame(
df "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)),
"score" = sample(c(1:100), 3 * 6)
)
<- df %>%
df arrange(participant)
# Add session info
$session <- rep(c('1','2', '3'), 6)
df
# Sample dataset to get imbalances
<- df %>%
df sample_frac(0.7) %>%
arrange(participant)
# Remove index
rownames(df) <- NULL
# Counts
%>%
df count(diagnosis, participant) %>%
kable(align = 'c')
diagnosis | participant | n |
---|---|---|
a | 1 | 3 |
a | 3 | 3 |
a | 4 | 1 |
b | 2 | 2 |
b | 5 | 3 |
b | 6 | 1 |
%>%
df count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 7 |
b | 6 |
kable(df, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 40 | 3 |
1 | 19 | a | 72 | 2 |
1 | 19 | a | 13 | 1 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
3 | 57 | a | 46 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 38 | 2 |
4 | 17 | a | 54 | 1 |
5 | 91 | b | 88 | 2 |
5 | 91 | b | 97 | 1 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
balance()
for downsampling without
id_col
<- balance(df, "min", cat_col = "diagnosis") %>%
df_balanced arrange(diagnosis, participant)
# Counts
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 6 |
b | 6 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 72 | 2 |
1 | 19 | a | 40 | 3 |
1 | 19 | a | 13 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 46 | 1 |
3 | 57 | a | 38 | 2 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
5 | 91 | b | 88 | 2 |
5 | 91 | b | 97 | 1 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
balance()
for downsampling with
id_col
and id_method
n_rows_c
<- balance(df, "min", cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c") %>%
df_balanced arrange(diagnosis, participant)
# Partition Sizes
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 6 |
b | 6 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 40 | 3 |
1 | 19 | a | 72 | 2 |
1 | 19 | a | 13 | 1 |
3 | 57 | a | 46 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 38 | 2 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
5 | 91 | b | 88 | 2 |
5 | 91 | b | 97 | 1 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
Let’s count how many of each participant there are in each diagnosis.
%>%
df_balanced count(diagnosis, participant) %>%
kable(align='c')
diagnosis | participant | n |
---|---|---|
a | 1 | 3 |
a | 3 | 3 |
b | 2 | 2 |
b | 5 | 3 |
b | 6 | 1 |
balance()
for upsampling without
id_col
<- balance(df, "max", cat_col = "diagnosis") %>%
df_balanced arrange(diagnosis, participant)
# Counts
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 7 |
b | 7 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 40 | 3 |
1 | 19 | a | 72 | 2 |
1 | 19 | a | 13 | 1 |
3 | 57 | a | 46 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 38 | 2 |
4 | 17 | a | 54 | 1 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
5 | 91 | b | 88 | 2 |
5 | 91 | b | 97 | 1 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
6 | 90 | b | 21 | 2 |
balance()
for upsampling with id_col
and id_method
n_rows_c
<- balance(df, "max", cat_col = "diagnosis",
df_balanced id_col = "participant", id_method = "n_rows_c") %>%
arrange(diagnosis, participant)
# Partition Sizes
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 7 |
b | 7 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 40 | 3 |
1 | 19 | a | 72 | 2 |
1 | 19 | a | 13 | 1 |
3 | 57 | a | 46 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 38 | 2 |
4 | 17 | a | 54 | 1 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
5 | 91 | b | 88 | 2 |
5 | 91 | b | 97 | 1 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
6 | 90 | b | 21 | 2 |
Let’s count how many of each participant there are in each diagnosis.
%>%
df_balanced count(diagnosis, participant) %>%
kable(align='c')
diagnosis | participant | n |
---|---|---|
a | 1 | 3 |
a | 3 | 3 |
a | 4 | 1 |
b | 2 | 2 |
b | 5 | 3 |
b | 6 | 2 |
balance()
for downsampling to a specific size
without id_col
<- balance(df, 3, cat_col = "diagnosis") %>%
df_balanced arrange(diagnosis, participant)
# Counts
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 3 |
b | 3 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 19 | a | 13 | 1 |
3 | 57 | a | 38 | 2 |
4 | 17 | a | 54 | 1 |
2 | 70 | b | 7 | 3 |
5 | 91 | b | 71 | 3 |
6 | 90 | b | 21 | 2 |
Let’s count how many of each participant there are in each diagnosis.
%>%
df_balanced count(diagnosis, participant) %>%
kable(align='c')
diagnosis | participant | n |
---|---|---|
a | 1 | 1 |
a | 3 | 1 |
a | 4 | 1 |
b | 2 | 1 |
b | 5 | 1 |
b | 6 | 1 |
balance()
for downsampling to a specific size
with id_col
and id_method
n_rows_c
<- balance(df, 3, cat_col = "diagnosis",
df_balanced id_col = "participant", id_method = "n_rows_c") %>%
arrange(diagnosis, participant)
# Partition Sizes
%>%
df_balanced count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 3 |
b | 3 |
kable(df_balanced, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
3 | 57 | a | 46 | 1 |
3 | 57 | a | 57 | 3 |
3 | 57 | a | 38 | 2 |
2 | 70 | b | 7 | 3 |
2 | 70 | b | 17 | 2 |
6 | 90 | b | 21 | 2 |
Let’s count how many of each participant there are in each diagnosis.
%>%
df_balanced count(diagnosis, participant) %>%
kable(align='c')
diagnosis | participant | n |
---|---|---|
a | 3 | 3 |
b | 2 | 2 |
b | 6 | 1 |
data frame
<- data.frame(
df "x" = c(1:12),
"species" = factor(rep(c('cat', 'pig', 'human'), 4)),
"age" = sample(c(1:100), 12)
)
group_factor()
with randomize
set
to TRUE
<- group_factor(df, 5, method = 'n_dist', randomize = TRUE)
groups
groups#> [1] 3 2 1 5 4 5 1 3 2 3 5 4
#> Levels: 1 2 3 4 5
splt()
with randomize
set to
TRUE
<- splt(df, 5, method = 'n_dist', randomize = TRUE)
df_list
%>% kable(align = 'c') df_list
|
|
|
|
|
In this section, we will take a look at the outputs we get from the different methods.
Below you’ll see a data frame with counts of group elements when
dividing up the same data with the different n_*
methods.
The forced-equal column is simply with the force_equal
set
to TRUE.
forced_equal: Since this is a setting to make sure that all groups are of the same size, it makes sense that all the groups have the same size.
n_dist: compared to forced_equal we see the 3 datapoints that forced_equal had discarded. These are distributed across the groups (in this example group 2,4 and 6)
n_fill: The 3 extra datapoints are located at the first 3
groups. Had we set descending = TRUE
, it would have been
the last 3 groups instead.
n_last: We see that n_last
creates equal group
sizes in all but the last group. This means that the last group can
sometimes have a group size, which is very large or small compared to
the other groups. Here it is a third larger than the other groups.
n_rand: The extra datapoints are placed randomly and so we
would see the extra datapoints located at different groups if we ran the
script again. Unless we use set.seed()
before
running the function.
#> x n_dist n_fill n_last n_rand forced_equal
#> 1 1 9 10 9 9 9
#> 2 2 10 10 9 10 9
#> 3 3 9 10 9 10 9
#> 4 4 10 9 9 9 9
#> 5 5 9 9 9 9 9
#> 6 6 10 9 12 10 9
Here is another example.
#> x n_dist n_fill n_last n_rand forced_equal
#> 1 1 10 11 11 11 10
#> 2 2 11 11 11 11 10
#> 3 3 10 11 11 11 10
#> 4 4 11 11 11 11 10
#> 5 5 11 11 11 10 10
#> 6 6 10 11 11 11 10
#> 7 7 11 11 11 10 10
#> 8 8 11 10 11 11 10
#> 9 9 10 10 11 10 10
#> 10 10 11 10 11 11 10
#> 11 11 11 10 7 10 10
Below you will see group sizes when using the method
greedy
and asking for group sizes of 8, 15, 20. What should
become clear is that only the last group can have a different group size
than what we asked for. This is important if, say, you want to split a
time series into groups of 100 elements, but the time series is not
divisible with 100. Then you could use force_equal
to
remove the excess elements, if you need equal groups.
With a size of 8, we get 13 groups. The last group (13) only contains 4 elements, but all the other groups contain 8 elements as specified.
With a size of 15, we get 7 groups. The last group (7) contains only 10 elements, but all the other groups contain 15 elements as specified.
With a size of 20, we get 5 groups. As 20 is divisible with the 100 elements that the split vector contained, the last group also contains 20 elements, and so we have equal groups.
Below you’ll see a plot with the group sizes at each group when using step sizes 2, 5, and 11.
At a step size of 2 elements it simply increases 2 for each group,
until the last group (32) where it runs out of elements. Had we set
force_equal = TRUE
, this last group would have been
discarded, because of the lack of elements.
At a step size of 5 elements it increases with 5 every time. Because of this it runs out of elements faster. Again we see that the last group (20) has fewer elements.
At a step size of 11 elements it increases with 11 every time. It
seems that the last group is not too small, but it can be hard to see on
this scale. Actually, the last group misses 1 element to be complete and
so would have been discarded if force_equal
was set to
TRUE
.
Below we will take a quick look at the cumulative sum of group
elements to get an idea of what is going on under the hood.
Remember that the split vector had 1000 elements? That is why they all
stop at 1000 on the y-axis. There are simply no more elements left!
Below you’ll see a plot with the group sizes at each group when starting from prime numbers 2, 5, and 11.
Below we will take a quick look at the cumulative sum of group
elements to get an idea of what is going on under the hood.
Because the split vector had 1000 elements, it stops at 1000 on the
y-axis. There are simply no more elements left!
You have reached the end! Now celebrate by taking the week off, splitting data and laughing!