As an open-source language, R has many competing methods of data processing. Selecting the preferred approach can depend on the goals of the work, the background of the practitioner, and the comparative advantages of the methods. A beginning level user of R may have to simultaneously learn about the structure of the data, new methods of analysis, and the computational tools for implementation. Coding syntax has a learning curve. The DTwrappers package can help to simplify the process of building many complex queries and data processing steps.
The DTwrappers package was developed to produce wrapper functions for the data.table package. Its capabilities encompass a broad range of data processing methods. This package aims to achieve several goals:
Create a simplified syntax for data processing that minimizes the need for complex coding statements, all with the goal of helping beginning level R programmers learn to work with the language and to analyze data.
Provide access to a wide range of data processing methods.
Utilize data.table's methods in calculations to provide competitive speed and memory efficiency in analysis.
Teach users the syntax of data.table's methods by providing translated coding statements for the calculations performed.
While DTwrappers is intended for beginning level analysts, its methods may also be of use to experienced programmers from other languages who are new to working with R. Furthermore, the package provides translations of all of its methods into data.table's syntax, which can allow users who are learning these methods to see many working examples of coding statements.
This vignette will introduce the methods of DTwrappers and provide working examples in the context of data analysis. We will primarily work with the iris data in R. The following code is used to create a new object dat that has randomly reordered the rows of the iris data.
library(DTwrappers)
data(iris)
n <- nrow(iris)
dt.name <- "dat"
RNGversion(vstr = 3.6)
set.seed(seed = 921)
# Randomizes the order of the rows.
dat <- data.table::as.data.table(x = iris[sample(x = 1:n, size = n, replace = F),
])
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.2 2.7 3.9 1.4 versicolor
#> 3: 6.4 2.8 5.6 2.2 virginica
#> 4: 4.8 3.0 1.4 0.3 setosa
#> 5: 7.2 3.2 6.0 1.8 virginica
#> 6: 6.3 3.3 4.7 1.6 versicolor
The data records the measurements in centimeters of the lengths and widths of the sepals and petals of sampled irises. The species are setosa, versicolor, and virginica. The sample includes a total of 150 observations, which are split evenly among the species.
The DTwrappers package offers the following methods:
These functions can be called quite simply, but they include many features that can facilitate more complex queries. For instance, dt.calculate can be used for simple calculations of a mean or standard deviation, but these measures can be applied on specified subsets of the data, performed separately in subgroups, and scaled to simultaneous calculations on a large number of variables.
The following section will provide examples of these methods.
Reducing the data by extracting a subset of relevant rows is a fundamental step in many analyses. The dt.choose.rows method is built as a wrapper function of data.table's filtering method (the i step). A series of logical tests on variables within the data may be specified. Options also exist to return a data.table coding statement (result = "code") for educational purposes or both the result and the code together (result = "all").
We can extract the first 3 rows by providing the name of the data (dt.name) and a filter (1:3 as a numeric variable).
dt.choose.rows(dt.name = "dat", the.filter = 1:3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.2 2.7 3.9 1.4 versicolor
#> 3: 6.4 2.8 5.6 2.2 virginica
We can also set the value of the.filter using a character version "1:3":
dt.choose.rows(dt.name = "dat", the.filter = "1:3")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.2 2.7 3.9 1.4 versicolor
#> 3: 6.4 2.8 5.6 2.2 virginica
We can then display the code by setting return.as to "code":
dt.choose.rows(dt.name = "dat", the.filter = "1:3", return.as = "code")
#> [1] "dat[1:3,]"
Setting return.as = "all" will return a list (object) with elements for the result and the code:
dt.choose.rows(dt.name = "dat", the.filter = "1:3", return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.2 2.7 3.9 1.4 versicolor
#> 3: 6.4 2.8 5.6 2.2 virginica
#>
#> $code
#> [1] "dat[1:3,]"
Logical tests may be used to extract rows that meet specified conditions. Here we utilize the.filter by providing a logical test (as a character variable) to extract all of the rows in which the Sepal.Length is less than 4.4:
dt.choose.rows(dt.name = "dat", the.filter = "Sepal.Length < 4.4", return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 4.3 3 1.1 0.1 setosa
#>
#> $code
#> [1] "dat[Sepal.Length < 4.4,]"
Values of the.filter may also be specified as expressions:
dt.choose.rows(dt.name = "dat", the.filter = expression(Sepal.Length < 4.4), return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 4.3 3 1.1 0.1 setosa
#>
#> $code
#> [1] "dat[Sepal.Length < 4.4,]"
More complex filtering may also be used:
dt.choose.rows(dt.name = "dat", the.filter = "Sepal.Width >= 3 & Sepal.Length < 4.8 & Species == 'setosa'",
return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 4.6 3.1 1.5 0.2 setosa
#> 2: 4.7 3.2 1.3 0.2 setosa
#> 3: 4.3 3.0 1.1 0.1 setosa
#> 4: 4.4 3.2 1.3 0.2 setosa
#> 5: 4.6 3.6 1.0 0.2 setosa
#> 6: 4.4 3.0 1.3 0.2 setosa
#> 7: 4.7 3.2 1.6 0.2 setosa
#> 8: 4.6 3.2 1.4 0.2 setosa
#> 9: 4.6 3.4 1.4 0.3 setosa
#>
#> $code
#> [1] "dat[Sepal.Width >= 3 & Sepal.Length < 4.8 & Species == 'setosa',]"
Notice that the value 'setosa' was specified within the filter using single quotation marks, while the value of the.filter was wrapped in double quotation marks. This can be switched, but they should not be mixed together. Any confusion around this can be resolved by specifying the.filter as an expression.
This function selects a subset of columns from a data.frame or data.table. It is built as a wrapper function of data.table's selection step (using .SD in the j step while specifying the .SDcols argument). Grouped operations may be used to make these selections of columns and rows in each category. We will illustrate its usage in the examples below.
Similar to dt.choose.rows, one can specify the parameters dt.name and the.filter to extract rows from the data. Then the relevant variables (columns) may be selected using the parameter the.variables. Setting this value to "." (the default) will choose all of the rows.
We will begin by selecting all of the variables from the first 5 rows:
dt.choose.cols(dt.name = "dat", the.variables = ".", the.filter = "1:5", return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.2 2.7 3.9 1.4 versicolor
#> 3: 6.4 2.8 5.6 2.2 virginica
#> 4: 4.8 3.0 1.4 0.3 setosa
#> 5: 7.2 3.2 6.0 1.8 virginica
#>
#> $code
#> [1] "dat[1:5,]"
Setting the.variables to a character vector of variable names can be used to extract a subset of the columns:
dt.choose.cols(dt.name = "dat", the.variables = c("Species", "Sepal.Length"), the.filter = "1:3",
return.as = "all")
#> $result
#> Species Sepal.Length
#> 1: setosa 5.1
#> 2: versicolor 5.2
#> 3: virginica 6.4
#>
#> $code
#> [1] "dat[1:3, .SD, .SDcols = c('Species', 'Sepal.Length')]"
Notice that the translation to data.table includes a call to .SD and .SDcols. The structure of dt.choose.cols simplifies this step. Furthermore, its coding translation can help those learning data.table to work more closely with these operators.
Grouping operations help in building more complex queries. When the grouping.variables parameter is specified, then the data will be partitioned into subgroups. Then, within each subgroup, the first.k parameter can be used to extract the first (up to) k rows of data:
dt.choose.cols(dt.name = "dat", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species", first.k = 2, return.as = "all")
#> $result
#> Species Sepal.Length Sepal.Width
#> 1: setosa 5.1 3.5
#> 2: setosa 4.8 3.0
#> 3: versicolor 5.2 2.7
#> 4: versicolor 6.3 3.3
#> 5: virginica 6.4 2.8
#> 6: virginica 7.2 3.2
#>
#> $code
#> [1] "dat[, .SD[1:min(.N, 2)], .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
This can equivalently be specified using the dt.first.k.rows function:
dt.first.k.rows(dt.name = "dat", k = 2, the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species", return.as = "all")
#> $result
#> Species Sepal.Length Sepal.Width
#> 1: setosa 5.1 3.5
#> 2: setosa 4.8 3.0
#> 3: versicolor 5.2 2.7
#> 4: versicolor 6.3 3.3
#> 5: virginica 6.4 2.8
#> 6: virginica 7.2 3.2
#>
#> $code
#> [1] "dat[, .SD[1:min(.N, 2)], .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
Within dt.choose.cols, we can also use last.k to select (up to) the last k rows for each subgroup:
dt.choose.cols(dt.name = "dat", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species", last.k = 2, return.as = "all")
#> $result
#> Species Sepal.Length Sepal.Width
#> 1: setosa 5.7 4.4
#> 2: setosa 4.4 2.9
#> 3: versicolor 4.9 2.4
#> 4: versicolor 6.1 2.9
#> 5: virginica 7.1 3.0
#> 6: virginica 6.0 2.2
#>
#> $code
#> [1] "dat[, .SD[max(1, .N - 1):.N], .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
This can equivalently be calculated with the function dt.last.k.rows:
dt.last.k.rows(dt.name = "dat", k = 2, the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species", return.as = "all")
#> $result
#> Species Sepal.Length Sepal.Width
#> 1: setosa 5.7 4.4
#> 2: setosa 4.4 2.9
#> 3: versicolor 4.9 2.4
#> 4: versicolor 6.1 2.9
#> 5: virginica 7.1 3.0
#> 6: virginica 6.0 2.2
#>
#> $code
#> [1] "dat[, .SD[max(1, .N - 1):.N], .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
More generally, a vector of rows to select for each group may also be specified using the row.indices parameter of dt.choose.cols:
dt.choose.cols(dt.name = "dat", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species", row.indices = c(3, 5, 7), return.as = "all")
#> $result
#> Species Sepal.Length Sepal.Width
#> 1: setosa 4.6 3.1
#> 2: setosa 5.7 3.8
#> 3: setosa 5.0 3.6
#> 4: versicolor 6.4 3.2
#> 5: versicolor 6.6 3.0
#> 6: versicolor 6.2 2.9
#> 7: virginica 6.3 2.9
#> 8: virginica 7.4 2.8
#> 9: virginica 5.8 2.7
#>
#> $code
#> [1] "dat[, .SD[(c(3, 5, 7))[(c(3, 5, 7)) %in% 1:min(.N, dat[, .N])]], .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
The dt.choose.cols method allows the user to 1) reduce the data to a subset, 2) group the data by one or more variables, 3) select columns of data, and 4) provide specialized subsets such as the first k, last k, or selected rows within each group. These operations involve a number of steps that might otherwise be challenging for new analysts to construct. The data.table code is somewhat more compact but requires knowledge of its specialized operators .N, .SD, and .SDcols. Meanwhile, the translated code also allows users to gain familiarity with data.table's methods.
This function counts the number of qualifying rows in a data.table or data.frame object. It is built as a wrapper function of data.table's counting method (the .N operator). These counts may be comprehensive for the entire table or conducted in groups. The full data can also be filtered for qualifying cases prior to counting the rows.
This method will provide a simple count, which is similar to the nrow function:
dt.count.rows(dt.name = "dat", return.as = "all")
#> $result
#> [1] 150
#>
#> $code
#> [1] "dat[, .N]"
Counts may also be specified in a subset of extracted rows using the.filter:
dt.count.rows(dt.name = "dat", the.filter = "Species == 'setosa'", return.as = "all")
#> $result
#> [1] 50
#>
#> $code
#> [1] "dat[Species == 'setosa', .N]"
Counting can also be performed in subgroups of the data:
dt.count.rows(dt.name = "dat", the.filter = "Petal.Length > 1.7", grouping.variables = "Species",
return.as = "all")
#> $result
#> Species N
#> 1: setosa 2
#> 2: versicolor 50
#> 3: virginica 50
#>
#> $code
#> [1] "dat[Petal.Length > 1.7, .N, keyby = Species]"
The name of the counting variable (N by default) may also be specified:
dt.count.rows(dt.name = "dat", the.filter = "Petal.Length > 1.7", grouping.variables = "Species",
count.name = "Total Qualifying Rows", return.as = "all")
#> $result
#> Species Total Qualifying Rows
#> 1: setosa 2
#> 2: versicolor 50
#> 3: virginica 50
#>
#> $code
#> [1] "dat[Petal.Length > 1.7, .(`Total Qualifying Rows` = .N), keyby = Species]"
This function allows a user to add a new variable to an existing data.frame or data.table. It can also be used to update the values of previously defined variables. It is built as a wrapper function of data.table's method of defining new variables by reference using the := operator. The new values can be stated either through a statement of the calculation or by directly providing a vector of values. These updates can also be performed on a subset of the data by incorporating a filter.
We will now create a new variable called Category with the constant value "Flower: Iris" in all rows of the data.
dt.define.variable(dt.name = "dat", variable.name = "Category", the.values = "Flower: Iris",
return.as = "all", specification = "by.value")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Category
#> 1: 5.1 3.5 1.4 0.2 setosa Flower: Iris
#> 2: 5.2 2.7 3.9 1.4 versicolor Flower: Iris
#> 3: 6.4 2.8 5.6 2.2 virginica Flower: Iris
#> 4: 4.8 3.0 1.4 0.3 setosa Flower: Iris
#> 5: 7.2 3.2 6.0 1.8 virginica Flower: Iris
#> ---
#> 146: 4.9 2.4 3.3 1.0 versicolor Flower: Iris
#> 147: 6.7 3.0 5.2 2.3 virginica Flower: Iris
#> 148: 6.1 2.9 4.7 1.4 versicolor Flower: Iris
#> 149: 7.1 3.0 5.9 2.1 virginica Flower: Iris
#> 150: 6.0 2.2 5.0 1.5 virginica Flower: Iris
#>
#> $code
#> [1] "dat[, Category := 'Flower: Iris']"
Setting specification = 'by.value' allows us to assign specific values to the variable rather than providing calculations.
Now we will create a new variable called Max_Sepal_Length_Species. This is meant to convey the maximum value of Sepal Length for each species. We can calculate this by applying the max() function to the Sepal.Length variable while grouping by species. Here we define this new variable using specification = "by.expression" because it calls for a calculation rather than directly supplying the values.
dt.define.variable(dt.name = "dat", variable.name = "Max_Sepal_Length_Species", the.values = "max(Sepal.Length)",
specification = "by.expression", grouping.variables = "Species", sortby.group = TRUE,
return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Category
#> 1: 5.1 3.5 1.4 0.2 setosa Flower: Iris
#> 2: 4.8 3.0 1.4 0.3 setosa Flower: Iris
#> 3: 4.6 3.1 1.5 0.2 setosa Flower: Iris
#> 4: 5.1 3.8 1.6 0.2 setosa Flower: Iris
#> 5: 5.7 3.8 1.7 0.3 setosa Flower: Iris
#> ---
#> 146: 6.7 3.3 5.7 2.1 virginica Flower: Iris
#> 147: 5.8 2.7 5.1 1.9 virginica Flower: Iris
#> 148: 6.7 3.0 5.2 2.3 virginica Flower: Iris
#> 149: 7.1 3.0 5.9 2.1 virginica Flower: Iris
#> 150: 6.0 2.2 5.0 1.5 virginica Flower: Iris
#> Max_Sepal_Length_Species
#> 1: 5.8
#> 2: 5.8
#> 3: 5.8
#> 4: 5.8
#> 5: 5.8
#> ---
#> 146: 7.9
#> 147: 7.9
#> 148: 7.9
#> 149: 7.9
#> 150: 7.9
#>
#> $code
#> [1] "dat[, Max_Sepal_Length_Species := max(Sepal.Length), keyby = Species]"
It is also possible to define new variables in specified rows by applying an appropriate filter. Here we define a new variable that only applies to the setosa flowers:
dt.define.variable(dt.name = "dat", variable.name = "setosa_sl_below_5", the.values = "Sepal.Length < 5",
specification = "by.expression", the.filter = "Species == 'setosa'", return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Category
#> 1: 5.1 3.5 1.4 0.2 setosa Flower: Iris
#> 2: 4.8 3.0 1.4 0.3 setosa Flower: Iris
#> 3: 4.6 3.1 1.5 0.2 setosa Flower: Iris
#> 4: 5.1 3.8 1.6 0.2 setosa Flower: Iris
#> 5: 5.7 3.8 1.7 0.3 setosa Flower: Iris
#> ---
#> 146: 6.7 3.3 5.7 2.1 virginica Flower: Iris
#> 147: 5.8 2.7 5.1 1.9 virginica Flower: Iris
#> 148: 6.7 3.0 5.2 2.3 virginica Flower: Iris
#> 149: 7.1 3.0 5.9 2.1 virginica Flower: Iris
#> 150: 6.0 2.2 5.0 1.5 virginica Flower: Iris
#> Max_Sepal_Length_Species setosa_sl_below_5
#> 1: 5.8 FALSE
#> 2: 5.8 TRUE
#> 3: 5.8 TRUE
#> 4: 5.8 FALSE
#> 5: 5.8 FALSE
#> ---
#> 146: 7.9 NA
#> 147: 7.9 NA
#> 148: 7.9 NA
#> 149: 7.9 NA
#> 150: 7.9 NA
#>
#> $code
#> [1] "dat[Species == 'setosa', setosa_sl_below_5 := Sepal.Length < 5]"
The new values are only supplied in the rows corresponding to setosa flowers. For the other rows of the data, the new variable is set with missing values.
This function is designed to remove columns of data. Note that once a variable is removed, its values cannot be retrieved.
dt.remove.variables(dt.name = "dat", the.variables = c("Category", "setosa_sl_below_5"),
return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.8 3.0 1.4 0.3 setosa
#> 3: 4.6 3.1 1.5 0.2 setosa
#> 4: 5.1 3.8 1.6 0.2 setosa
#> 5: 5.7 3.8 1.7 0.3 setosa
#> ---
#> 146: 6.7 3.3 5.7 2.1 virginica
#> 147: 5.8 2.7 5.1 1.9 virginica
#> 148: 6.7 3.0 5.2 2.3 virginica
#> 149: 7.1 3.0 5.9 2.1 virginica
#> 150: 6.0 2.2 5.0 1.5 virginica
#> Max_Sepal_Length_Species
#> 1: 5.8
#> 2: 5.8
#> 3: 5.8
#> 4: 5.8
#> 5: 5.8
#> ---
#> 146: 7.9
#> 147: 7.9
#> 148: 7.9
#> 149: 7.9
#> 150: 7.9
#>
#> $code
#> [1] "dat[, c('Category', 'setosa_sl_below_5') := NULL]"
Reviewing the translated code, this step is achieved by setting the variables to remove equal to a value of NULL. Note that this operation cannot be performed in subsets of the rows of the data; it removes all measurements of the selected variables. However, dt.define.variable can be used to set specific values to missing (NA) in subsets if that is sufficient.
This function sorts the rows of a data.frame or data.table based on selected columns. It is built as a light wrapper function of data.table's setorderv() function.
dt.sort(dt.name = "dat", sorting.variables = c("Species", "Sepal.Length"), sort.increasing = TRUE,
return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 4.3 3.0 1.1 0.1 setosa
#> 2: 4.4 3.2 1.3 0.2 setosa
#> 3: 4.4 3.0 1.3 0.2 setosa
#> 4: 4.4 2.9 1.4 0.2 setosa
#> 5: 4.5 2.3 1.3 0.3 setosa
#> ---
#> 146: 7.7 3.8 6.7 2.2 virginica
#> 147: 7.7 2.6 6.9 2.3 virginica
#> 148: 7.7 2.8 6.7 2.0 virginica
#> 149: 7.7 3.0 6.1 2.3 virginica
#> 150: 7.9 3.8 6.4 2.0 virginica
#> Max_Sepal_Length_Species
#> 1: 5.8
#> 2: 5.8
#> 3: 5.8
#> 4: 5.8
#> 5: 5.8
#> ---
#> 146: 7.9
#> 147: 7.9
#> 148: 7.9
#> 149: 7.9
#> 150: 7.9
#>
#> $code
#> [1] "data.table::setorderv(x = dat, cols = c('Species', 'Sepal.Length'), order = 1, na.last = FALSE)"
The sorting proceeds in the order of the specified sorting.variables. In the above example, the data are sorted by Species first. Then, within groups of Species, the data are additionally sorted by increasing Sepal.Length.
The data can be sorted in decreasing order by specifying sort.increasing = FALSE:
dt.sort(dt.name = "dat", sorting.variables = c("Species", "Sepal.Length"), sort.increasing = FALSE,
return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 7.9 3.8 6.4 2.0 virginica
#> 2: 7.7 3.8 6.7 2.2 virginica
#> 3: 7.7 2.6 6.9 2.3 virginica
#> 4: 7.7 2.8 6.7 2.0 virginica
#> 5: 7.7 3.0 6.1 2.3 virginica
#> ---
#> 146: 4.5 2.3 1.3 0.3 setosa
#> 147: 4.4 3.2 1.3 0.2 setosa
#> 148: 4.4 3.0 1.3 0.2 setosa
#> 149: 4.4 2.9 1.4 0.2 setosa
#> 150: 4.3 3.0 1.1 0.1 setosa
#> Max_Sepal_Length_Species
#> 1: 7.9
#> 2: 7.9
#> 3: 7.9
#> 4: 7.9
#> 5: 7.9
#> ---
#> 146: 5.8
#> 147: 5.8
#> 148: 5.8
#> 149: 5.8
#> 150: 5.8
#>
#> $code
#> [1] "data.table::setorderv(x = dat, cols = c('Species', 'Sepal.Length'), order = -1, na.last = FALSE)"
Each sorting variable's ordering can be specified as either increasing (TRUE) or decreasing (FALSE):
dt.sort(dt.name = "dat", sorting.variables = c("Species", "Sepal.Length"), sort.increasing = c(T,
F), return.as = "all")
#> $result
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.8 4.0 1.2 0.2 setosa
#> 2: 5.7 3.8 1.7 0.3 setosa
#> 3: 5.7 4.4 1.5 0.4 setosa
#> 4: 5.5 3.5 1.3 0.2 setosa
#> 5: 5.5 4.2 1.4 0.2 setosa
#> ---
#> 146: 5.8 2.8 5.1 2.4 virginica
#> 147: 5.8 2.7 5.1 1.9 virginica
#> 148: 5.7 2.5 5.0 2.0 virginica
#> 149: 5.6 2.8 4.9 2.0 virginica
#> 150: 4.9 2.5 4.5 1.7 virginica
#> Max_Sepal_Length_Species
#> 1: 5.8
#> 2: 5.8
#> 3: 5.8
#> 4: 5.8
#> 5: 5.8
#> ---
#> 146: 7.9
#> 147: 7.9
#> 148: 7.9
#> 149: 7.9
#> 150: 7.9
#>
#> $code
#> [1] "data.table::setorderv(x = dat, cols = c('Species', 'Sepal.Length'), order = c(1, -1), na.last = FALSE)"
The prior operations of selecting rows or columns, filtering the data, counting the number of rows, defining or removing variables, and sorting the data provide helpful tools. However, most analyses of data rely upon some form of calculation. This may apply to a single variable or be repeated across many variables. Considerations of filtering and grouping variables may also apply. The dt.calculate function is designed to facilitate calculations using additional methods in R to one or more variables of the data.
Here we demonstrate a calculation of the mean of the Sepal.Length:
dt.calculate(dt.name = "dat", the.functions = "mean", the.variables = "Sepal.Length",
return.as = "all")
#> $result
#> variable mean
#> 1: Sepal.Length 5.843333
#>
#> $code
#> [1] "dat[, .(variable = 'Sepal.Length', mean = mean(Sepal.Length))]"
This can be extended to multiple functions:
dt.calculate(dt.name = "dat", the.functions = c("mean", "median", "sd"), the.variables = "Sepal.Length",
return.as = "all")
#> $result
#> variable mean median sd
#> 1: Sepal.Length 5.843333 5.8 0.8280661
#>
#> $code
#> [1] "dat[, .(variable = 'Sepal.Length', mean = mean(Sepal.Length), median = median(Sepal.Length), sd = sd(Sepal.Length))]"
It is also possible to supply additional arguments to the functions by specifying the other.params variable:
dt.calculate(dt.name = "dat", the.functions = c("mean", "median", "sd"), the.variables = "Sepal.Length",
other.params = "na.rm = T", return.as = "all")
#> $result
#> variable mean median sd
#> 1: Sepal.Length 5.843333 5.8 0.8280661
#>
#> $code
#> [1] "dat[, .(variable = 'Sepal.Length', mean = mean(Sepal.Length, na.rm = T), median = median(Sepal.Length, na.rm = T), sd = sd(Sepal.Length, na.rm = T))]"
Note that when multiple functions are specified, the supplied values of other.params must be compatible with all of these functions. So, for instance, setting na.rm = T can apply to the mean, median, and standard deviation functions, but it would not necessarily work with other functions -- e.g. nrow() -- that might be considered.
Calculations can be extended to allow for filtering and grouping:
dt.calculate(dt.name = "dat", the.functions = c("mean", "median", "sd"), the.variables = "Sepal.Length",
the.filter = "Sepal.Length > 3.5", grouping.variables = "Species", return.as = "all")
#> $result
#> Species variable mean median sd
#> 1: setosa Sepal.Length 5.006 5.0 0.3524897
#> 2: versicolor Sepal.Length 5.936 5.9 0.5161711
#> 3: virginica Sepal.Length 6.588 6.5 0.6358796
#>
#> $code
#> [1] "dat[Sepal.Length > 3.5, .(variable = 'Sepal.Length', mean = mean(Sepal.Length), median = median(Sepal.Length), sd = sd(Sepal.Length)), keyby = Species]"
This can also be scaled to include more than one variable:
dt.calculate(dt.name = "dat", the.functions = c("mean", "median", "sd"), the.variables = c("Sepal.Length",
"Sepal.Width", "Petal.Length", "Petal.Width"), the.filter = "Sepal.Length > 3.5",
grouping.variables = "Species", return.as = "all")
#> $result
#> Species variable mean median sd
#> 1: setosa Sepal.Length 5.006 5 0.3524897
#> 2: setosa Sepal.Width 3.428 3.4 0.3790644
#> 3: setosa Petal.Length 1.462 1.5 0.173664
#> 4: setosa Petal.Width 0.246 0.2 0.1053856
#> 5: versicolor Sepal.Length 5.936 5.9 0.5161711
#> 6: versicolor Sepal.Width 2.77 2.8 0.3137983
#> 7: versicolor Petal.Length 4.26 4.35 0.469911
#> 8: versicolor Petal.Width 1.326 1.3 0.1977527
#> 9: virginica Sepal.Length 6.588 6.5 0.6358796
#> 10: virginica Sepal.Width 2.974 3 0.3224966
#> 11: virginica Petal.Length 5.552 5.55 0.5518947
#> 12: virginica Petal.Width 2.026 2 0.2746501
#>
#> $code
#> [1] "dat[Sepal.Length > 3.5, .(variable = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width'), mean = lapply(X = .SD, FUN = 'mean'), median = lapply(X = .SD, FUN = 'median'), sd = lapply(X = .SD, FUN = 'sd')), .SDcols = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width'), keyby = Species]"
The table.format variable can be used to shift the results to wide format:
# dt.calculate(dt.name = 'dat', the.functions = c('mean', 'median', 'sd'),
# the.variables = c('Sepal.Length', 'Sepal.Width', 'Petal.Length',
# 'Petal.Width'), the.filter = 'Sepal.Length > 3.5', grouping.variables =
# 'Species', table.format = 'wide', return.as = 'all')
When wide formats are used for calculations on multiple variables and functions, then the results create columns with names of the form variable_function. Additionally, to create tables in wide format, the translated data.table statement does not use multiple lapply statements but rather enumerates each function as applied to each variable.
Whether the function to apply exists as a base method, from a loaded package, or is a customized function defined by the user, dt.calculate is able to make use of the method. Here we create a simple function called triple mean
that calculates 3 times the mean value:
`triple mean` <- function(x, na.rm = T) {
return(3 * mean(x = x, na.rm = na.rm))
}
dt.calculate(dt.name = "dat", the.variables = c("Sepal.Length", "Sepal.Width"), the.functions = c("mean",
"sd", "triple mean"), grouping.variables = "Species", table.format = "long",
return.as = "all")
#> $result
#> Species variable mean sd triple mean
#> 1: setosa Sepal.Length 5.006 0.3524897 15.018
#> 2: setosa Sepal.Width 3.428 0.3790644 10.284
#> 3: versicolor Sepal.Length 5.936 0.5161711 17.808
#> 4: versicolor Sepal.Width 2.77 0.3137983 8.31
#> 5: virginica Sepal.Length 6.588 0.6358796 19.764
#> 6: virginica Sepal.Width 2.974 0.3224966 8.922
#>
#> $code
#> [1] "dat[, .(variable = c('Sepal.Length', 'Sepal.Width'), mean = lapply(X = .SD, FUN = 'mean'), sd = lapply(X = .SD, FUN = 'sd'), `triple mean` = lapply(X = .SD, FUN = 'triple mean')), .SDcols = c('Sepal.Length', 'Sepal.Width'), keyby = Species]"
The examples of dt.calculate provided so far are meant to apply to individual variables. Calculating the mean or the standard deviation is separately applied to each variable. However, some functions -- e.g. rowSums or linear regression -- are meant to apply to multiple variables with a single calculation. These can be incorporated into dt.calculate by specifying individual.variables = FALSE. Here we will review a few examples of these methods.
First, we will calculate the row sums of the numeric variables for the first 5 rows:
dt.calculate(dt.name = "dat", the.functions = "rowSums", the.variables = c("Sepal.Length",
"Sepal.Width", "Petal.Length", "Petal.Width"), the.filter = 1:5, individual.variables = FALSE,
return.as = "all")
#> $result
#> rowSums
#> 1: 11.2
#> 2: 11.5
#> 3: 12.0
#> 4: 10.5
#> 5: 11.3
#>
#> $code
#> [1] "dat[1:5, .(rowSums = rowSums(.SD)), .SDcols = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width')]"
Here we will construct an example of extracting the table of estimated coefficients from a linear regression that is separately performed in each species. To do so, we will first define a customized extraction function:
get.lm.coefs <- function(data, formula) {
require(data.table)
mod <- lm(formula = formula, data = data)
the.coefs <- as.data.table(x = summary(mod)$coefficients, keep.rownames = TRUE)
setnames(x = the.coefs, old = "rn", new = "Variable")
return(the.coefs)
}
Then we will apply this function to a linear regression of Sepal.Length in terms of the other numeric variables, calculated separately for each species:
## linear regression
dt.calculate(dt.name = "dat", the.functions = "get.lm.coefs", grouping.variables = "Species",
other.params = "formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width",
return.as = "all", individual.variables = F, add.function.name = F)
#> Loading required package: data.table
#> Warning: package 'data.table' was built under R version 3.6.2
#> $result
#> Species Variable Estimate Std. Error t value Pr(>|t|)
#> 1: setosa (Intercept) 2.3518898 0.39286751 5.9864707 3.034183e-07
#> 2: setosa Sepal.Width 0.6548350 0.09244742 7.0833236 6.834434e-09
#> 3: setosa Petal.Length 0.2375602 0.20801921 1.1420107 2.593594e-01
#> 4: setosa Petal.Width 0.2521257 0.34686362 0.7268727 4.709870e-01
#> 5: versicolor (Intercept) 1.8955395 0.50705524 3.7383295 5.112246e-04
#> 6: versicolor Sepal.Width 0.3868576 0.20454490 1.8913091 6.488965e-02
#> 7: versicolor Petal.Length 0.9083370 0.16543248 5.4906811 1.666695e-06
#> 8: versicolor Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
#> 9: virginica (Intercept) 0.6998830 0.53360089 1.3116227 1.961563e-01
#> 10: virginica Sepal.Width 0.3303370 0.17432873 1.8949086 6.439972e-02
#> 11: virginica Petal.Length 0.9455356 0.09072204 10.4223360 1.074269e-13
#> 12: virginica Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01
#>
#> $code
#> [1] "dat[, get.lm.coefs(.SD, formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width), keyby = Species]"
DTwrappers presents a wide range of data processing applications. They are all built with the goal of reducing the requirements to learn computer programming. These methods are not without some degree of syntax. However, these methods are intended to reduce the learning curve needed to perform data analysis. DTwrappers can allow a user to focus on the core analytical problems without the requirement of specialized coding operators or the necessary ordering of multiple functions. For some beginning analysts, the DTwrappers package may help them acclimate to working with data and utilizing more advanced methods. Programmers with experience in other languages can also more easily transfer that knowledge by using DTwrappers.
The running time complexity of DTwrappers is a function of its translations along with processing the data using data.table's methods. The translation step includes some degree of additional overhead. However, this portion is not tied to the overall sample size of the data. In practice, the running time performance of DTwrappers should not be substantially different from that of data.table, especially at larger sample sizes.
Additionally, it should be noted that the DTwrappers is intended to serve as a supplement to existing data processing packages rather than as a replacement. Its methods may be beneficial for a subset of R's users. Determining the relative utility of the package is an area of further exploration.