After briefly describing the problem that fplyr
tries to solve, this vignette will go through all the functions in the package, explaining their usage. In order to make the most of this package, a certain degree of familiarity with the data.table
package is suggested. Often, if one has trouble understanding an option, it will be possible to find detailed help in the manual of data.table
’s fread() function. Furthermore, basic acquaintance with the *ply family of functions in R, especially lapply(), will also be helpful. You are encouraged to run the code of this vignette on your own and explore the output of the commands.
A very common operation when analyzing data is that of splitting the observations into groups and applying a function to each group, separately. So common is this operation, that in R there are at least two functions that implement it: by() and aggregate(). However, using these functions requires that the data be loaded into the RAM, and often the files are too big to fit in the memory. fplyr
was born to solve this problem: it allows to perform split-apply-combine operations to very big files; by reading the files chunk by chunk, only a limited number of rows is stored in memory at any given time.
fplyr
combines the strengths of two other packages: iotools
and data.table
. While iotools
has some functions, such as chunk.apply(), to apply a function to chunks of files, the chunks may not reflect the actual groups in which the data are partitioned. In particular, a ‘chunk’ may contain observations pertaining to several different groups, and the task of further splitting them is left to the user. In fplyr
, on the other hand, the further splitting is done automatically (thanks also to the data.table
package), so the user needs not worry about it.
Before using fplyr
you need to ensure that the input file is in the correct format. First and foremost, the data must be amenable to the split-apply-combine paradigm, so the observations must be grouped according to the value of a certain field. We refer to the values of the ‘groupby’ field as the subjects. Thus, for instance, in the famous iris
data set, each species would be a different subject. All the observations pertaining to the same subject constitute a block.
In fplyr
the input file must be formatted in such a way that the first field contains all the subject IDs. If the IDs are not in the first field, it won’t work. Moreover, all the observations referring to the same subject must be consecutive; in other words, the file must be sorted on the first field, the reason being that the file is read block by block. Indeed, the subject ID of one line is compared with that of the previous line, and the reading goes on until the IDs are the same.1 Note that fplyr
always ensures that all the rows with the same subject ID are read together in the same batch, but only if the rows are consecutive. To make sure that a file complies with these specifications, it is possible to use *nix command-line tools such as awk
and sort
.
As an example file, in this vignette we will use a modified version of the iris
dataset where the species has been relocated to the first column. This file is very small and would probably be accommodated even in the RAM of old hardware, so fplyr
would not be necessary. Nevertheless, this file is attached to the package, meaning that it will be immediately available to all users, and despite its having only three blocks, it will still illustrate the most important features of fplyr
. We begin by storing the path to this file into the variable f
:
f <- system.file("extdata", "dt_iris.csv", package = "fplyr")
# Let's have a look at the first four lines of the file
fread(f, nrows = 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
Use flply() when you want to obtain a list where each element corresponds to a subject and contains the result of the processing of the corresponding block. In our examples, the output of flply() will contain three elements, one for each Iris species. The elements of the list will be conveniently named after the subject IDs.
fplyr
allows you to apply a function to each block of the file. For the sake of distinguishing the user-specified function to be applied to each block from other functions, we shall refer to it as FUN
. In the first example we will obtain the summary() of each species. In general, all the functions in the package support two fundamental arguments: the path to the input file, and FUN
.
species_summ <- flply(input = f, FUN = summary)
# Now `species_summ` is a list of three elements; let's show the 'versicolor' element
species_summ$versicolor
#> Species Sepal.Length Sepal.Width Petal.Length
#> Length:50 Min. :4.900 Min. :2.000 Min. :3.00
#> Class :character 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00
#> Mode :character Median :5.900 Median :2.800 Median :4.35
#> Mean :5.936 Mean :2.770 Mean :4.26
#> 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60
#> Max. :7.000 Max. :3.400 Max. :5.10
#> Petal.Width
#> Min. :1.000
#> 1st Qu.:1.200
#> Median :1.300
#> Mean :1.326
#> 3rd Qu.:1.500
#> Max. :1.800
For flply(), FUN
can be any function that takes as input a “data.frame”; summary() was just an example, but other appropriate functions are str(), as.matrix(), and so on. Of course, if you cannot find a function that does what you want, you can write your own FUN
, as we shall see in the next example, where we’ll perform hierarchical clustering within each species.2 Note that this is also how functions like lapply() work.
clusters <- flply(f, FUN = function(d) {
dm <- dist(d[, -1]) # Compute the distance matrix, excluding the first field
hclust(dm) # Perform the clustering and return the object
})
# The `cluster` variable contains one "hclust" object for each species.
# Let's plot the 'setosa' dendrogram
plot(clusters$setosa)
If FUN
takes more than one argument, it is possible to pass any additional argument directly to flply(): they will be passed, in turn, to FUN
. For instance, suppose that we want to use kmeans() instead of hclust(), and we want to specify the number of centroids as an additional parameter. In the next example we will also define FUN
as a separate function before using it, rather than writing an anonymous function like in the previous example. The output will be a “kmeans” object for each species.
kmeans_FUN <- function(d, my_centers) {
kmeans(d[, -1], centers = my_centers)
}
my_centers <- 2
# We pass `my_centers` to flply(), and flply() passes it to kmeans_FUN
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
# Let's display the centers of the 'setosa' clusters
clusters$setosa$centers
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 4.704545 3.122727 1.413636 0.2000000
#> 2 5.242857 3.667857 1.500000 0.2821429
# Now let's do the same thing, but with three centers for each species
my_centers <- 3
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
clusters$setosa$centers
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 4.678947 3.084211 1.378947 0.200000
#> 2 5.512500 4.000000 1.475000 0.275000
#> 3 5.100000 3.513043 1.526087 0.273913
The last example of this section may be a bit surprising. Since, in R, [[
is a function, nothing prevents us from using it as FUN
to select only, say, the second column of each block. Admittedly, however, in this case it would be better to use the select
option (see ?flply
and ?fread
, or wait for the Other options subsection).
sepal_length <- flply(f, `[[`, 2)
# Now `sepal_length` contains all the sepal lengths, divided by species
sepal_length
#> $setosa
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
#> [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
#> [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
#>
#> $versicolor
#> [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
#> [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
#> [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
#>
#> $virginica
#> [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
#> [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
#> [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
We followed the same convention of the plyr
package. The name of each function consists of two letters followed by ‘ply’: the first letter represent the type of input, whereas the second letter characterizes the type of output, and the final ‘ply’ clinches the relation with the existing ‘apply’ family of functions. The first letter is usually ‘f’, because the input is the path to a file. The second letter is ‘l’ if the output is a list, as in flply(), it is ‘t’ if the output is a “data.table”, ‘f’ if the output is another file, and ‘m’ if the output can be multiple things.
Use ftply() to return a “data.table”; the rows corresponding to the different subjects will be rbind
ed together. Needless to say, in this case FUN
must return a “data.frame” or a “data.table”, while in flply() there was no such restriction. (When fplyr
is loaded, the data.table
package is loaded as well.) Moreover, in this case FUN
has to take at least two arguments: the first one being a “data.table” corresponding to the current block being processed, and the second one being a character vector containing the subject ID. This is best explained with an example:
selected_flowers <- ftply(f, function(d, by) {
if (by == "setosa")
return(NULL)
else
return(d)
})
#> Warning in ftply(f, function(d, by) {: Block 1 returned an empty data.table.
# Let's have a look at the first few lines of the output; note that it start directly with 'versicolor', because all the 'setosa' flowers have been omitted
head(selected_flowers, 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: versicolor 7.0 3.2 4.7 1.4
#> 2: versicolor 6.4 3.2 4.5 1.5
#> 3: versicolor 6.9 3.1 4.9 1.5
#> 4: versicolor 5.5 2.3 4.0 1.3
Here, we are skipping the ‘setosa’ species. The result will be equal to the input, except that the rows corresponding to the setosa flowers will be omitted. Notice also that fplyr
warns us that one block didn’t return any output. In general, the behavior of ftply() is equivalent to flply() followed by rbind
on the resulting list.
Importantly, the d
argument to FUN
contains a “data.table” of the current block being processed, but without the first field. This is just for efficiency concerns; the first field will be added back to the output of FUN
. In fact, the following example will show that inside FUN
the d
data set has only four columns, whereas normally it would have five.
count_cols <- function(d, by) {
ncol(d)
}
ftply(f, count_cols)
#> Species V1
#> 1: setosa 4
#> 2: versicolor 4
#> 3: virginica 4
nblocks
optionftply() can also be used to quickly glance at the data, much like one would use the head() function. Indeed, we can specify the nblocks
option to select only the first block; thus, we can see what the data look like without loading the whole file into memory. By default, in ftply() FUN
returns the data without modifying them, so in this case we can avoid specifying FUN
. Incidentally, all the other functions support the nblocks
option as well; it is intended to be the analogous of nrows
in read.table() and fread().
flowers_head <- ftply(f, nblocks = 1)
# Now `flowers_head` has 50 observations, while the original data set had 150. Let's have a look at the first ones.
head(flowers_head, 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
Another useful option is parallel
, with which it is possible to specify the number of threads that fplyr
can use. Like nblocks
, also the parallel
option is supported by all the functions. It is not necessary to initialize any cluster, but this option has effect only on Unix-like systems, not on Windows. In the following example we will select, for each block, a random sample of 10 observations.
This package was born to deal with files that are too big to fit into the available RAM. With fplyr
, it is easy to process such files, but what if even the output of the processing is too big for the memory? One solution could be to write the output to a file as soon as it is generated, without ever returning it. This solution is implemented in the ffply() function, but it works only if FUN
returns a “data.table” or “data.frame”. It is equivalent to calling ftply() followed by write.table() or fread(). This function supports one additional argument with respect to the previously described functions: the path to the output file. In the example, we will replace the original observations with their principal components, block by block.
out <- tempfile() # Create temporary output file
ffply(f, out, function(d, by) {
# Here, `d` does not contain the subject IDs; they will be automatically added back later
x <- prcomp(d)$x
as.data.table(x)
})
# Let's check the result. Note in particular that the subject IDs are present
fread(out, nrows = 4)
#> Species PC1 PC2 PC3 PC4
#> 1: setosa -0.1068424 -0.02489398 0.08216974 -0.034541755
#> 2: setosa 0.3940472 0.16586593 0.13148092 -0.017551195
#> 3: setosa 0.3906877 -0.12685112 0.07181182 0.009744303
#> 4: setosa 0.5117016 -0.02656106 -0.11121361 -0.032673214
For ffply(), FUN
must take two arguments, like in ftply(). The return value of ffply() is the number of processed blocks.
Besides the options we have already discussed, such as nblocks
and parallel
, all the functions in the package support a set of core options that modify how the file is read. These options are as follows.
key.sep
The character that delimits the first field from the rest [default: “\t”].sep
The field delimiter (often equal to key.sep
) [default: “\t”].skip
Number of lines to skip at the beginning of the file [default: 0].header
Whether the file has a header [default: TRUE].nblocks
The number of blocks to read [default: Inf].stringsAsFactors
Whether to convert strings into factors [default: FALSE].colClasses
Vector or list specifying the class of each field [default: NULL].select
The columns (names or numbers) to be read [default: NULL].drop
The columns (names or numbers) not to be read [default: NULL].col.names
Names of the columns [default: NULL].With the exception of key.sep
, all these options are comprehensively documented in the help page of data.table
‘s fread() function (?fread
). For key.sep
, see the help page of iotools
’ read.chunk() (?read.chunk
).
For the last function, suppose that the analysis of each block produces several output files; for instance, we may want to compute the principal components as well as a nonlinear transformation of the variables, for each block, and save them to two separate output files. In this case, we can use fmply(). Like ftply(), it too supports the output
option, but this time it can be a vector of many paths. Accordingly, FUN
should now return a list of “data.table”s, one for each of the output files.
out <- c(pca = tempfile(), transf = tempfile())
# Note that the vector needs not be named, we use these names just for convenience
analyze_block <- function(d) {
# Here, `d` does contain the subject IDs, so we have to remove them...
x <- prcomp(d[, -1])$x
# ...and add them back manually
x <- cbind(d[, 1], x)
# Transform each number 'z' into e^(-z)
y <- cbind(d[, 1], exp(-d[, -1]))
# Return a list of two "data.table"s
list(x, y)
}
fmply(f, out, analyze_block)
Notice that, contrary to ffply(), FUN
takes only one argument, and it is the full block, including the first field. Therefore, we had to remove this field when we computed the principal components, and add it back at the end. (In ffply() and ftply() this is done automatically.) Moreover, FUN
should now return two values, the first of which is printed to the first output file, and the second of which is printed to the second output file. There is no limit to the number of output files, but the order of the output files and of the values returned by FUN
must match (named vectors and lists are not taken into account at the moment).
Sometimes it is also necessary to return objects that are not printable as “data.table”s. For instance, suppose that, besides printing the principal components to the output file, we also wanted to return the "prcomp"
object. In these cases, fmply() is still helpful, because it allows FUN
to return one more element, which in turn will be returned by fmply(). For example, consider the following modification of analyze_block():
analyze_block2 <- function(d) {
pca <- prcomp(d[, -1])
x <- cbind(d[, 1], pca$x)
y <- cbind(d[, 1], exp(-d[, -1]))
# 'x' and 'y' are the same as before, but now we add the 'pca' object
list(x, y, pca)
}
iris_pca <- fmply(f, out, analyze_block2)
# Let's have a look at the screeplot of the 'versicolor' PCA
screeplot(iris_pca$versicolor)
Here, FUN
returns three arguments, but there are only two output files. The third value returned by FUN
, then, is returned at the end by fmply(). In particular, the variable iris_pca
will be a list of three "prcomp"
objects, one for each species.
Actually, it is a bit more complicated than that: the iotools
package takes care of the reading, so the file is read chunk by chunk, not block by block, and then the chunk is split into its constituent blocks; you can read more about how iotools
reads files in the help page of the chunk.reader() function.↩
Yes, I know this clustering is pointless, but the example is just meant to illustrate the kind of things that one can do, provided that he or she has access to more appropriate data sets.↩