Descriptive statistics in ‘Tplyr’ are created using group_desc()
function when creating a layer. While group_desc()
allows you to set your target, by variables, and filter criteria, a great deal of the control of the layer comes from set_format_strings()
where the actual summaries are declared.
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
set_format_strings(
"n" = f_str("xx", n),
"Mean (SD)"= f_str("xx.x (xx.xx)", mean, sd),
"Median" = f_str("xx.x", median),
"Q1, Q3" = f_str("xx, xx", q1, q3),
"Min, Max" = f_str("xx, xx", min, max),
"Missing" = f_str("xx", missing)
)%>%
) build() %>%
kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age (years) | n | 86 | 84 | 84 | 1 | 1 | 1 |
Age (years) | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) | 1 | 1 | 2 |
Age (years) | Median | 76.0 | 76.0 | 77.5 | 1 | 1 | 3 |
Age (years) | Q1, Q3 | 69, 82 | 71, 80 | 71, 82 | 1 | 1 | 4 |
Age (years) | Min, Max | 52, 89 | 56, 88 | 51, 88 | 1 | 1 | 5 |
Age (years) | Missing | 0 | 0 | 0 | 1 | 1 | 6 |
Let’s walk through this call to set_format_strings
to understand in detail what’s going on:
set_format_strings()
become the row label in the output. This allows you to define some custom text in set_format_strings()
to explain the summary that is presented on the associated row. This text is fully in your control.f_str()
. As explained in the vignette("Tplyr")
, this is an object that captures a lot of metadata to understand how the strings should be presented.f_str()
call, you see x’s in quotes. This defines how you’d like the numbers formatted from the resulting summaries. The number of x’s you use on the left side of a decimal control the space allotted for an integer, and the right side controls the decimal precision. Decimals are rounded prior to string formatting - so no need to worry about that. Note that this forcefully sets the decimal and integer precision - ‘Tplyr’ can automatically determine this for you as well, but more on that later.f_str()
calls have two summaries specified. This allows you to put two summaries in the same string and present them on the same line.But where do these summary names come from? And which ones does ‘Tplyr’ have?
We’ve built a number of default summaries into ‘Tplyr’, which allows you to perform these summaries without having to specify the functions to calculate them yourself. The summaries built in to ‘Tplyr’ are listed below. In the second column are the names that you would use within an f_str()
call to use them. In the third column, we have the syntax used to make the function call.
Statistic | Variable Names | Function Call |
---|---|---|
N | n | n() |
Mean | mean | mean(.var, na.rm=TRUE) |
Standard Deviation | sd | sd(.var, na.rm=TRUE) |
Median | median | median(.var, na.rm=TRUE) |
Variance | variance | var(.var, na.rm=TRUE) |
Minimum | min | min(.var, na.rm=TRUE) |
Maximum | max | max(.var, na.rm=TRUE) |
Interquartile Range | iqr | IQR(.var, na.rm=TRUE, type=getOption(‘tplyr.quantile_type’) |
Q1 | q1 | quantile(.var, na.rm=TRUE, type=getOption(‘tplyr.quantile_type’))[[2]] |
Q3 | q3 | quantile(.var, na.rm=TRUE, type=getOption(‘tplyr.quantile_type’))[[4]] |
Missing | missing | sum(is.na(.var)) |
Note that the only non-default option being used in any of the function calls above is na.rm=TRUE
. For most of the functions, this is likely fine - but with IQR, Q1, and Q3 note that there are several different quantile algorithms available in R. The default we chose to use is the R default of Type 7:
\[
m = 1-p. p[k] = (k - 1) / (n - 1). \textrm{In this case, } p[k] = mode[F(x[k])]. \textrm{This is used by S.}
\] That said, we still want to offer some flexibility here, so you can change the quantile algorithm by switching the tplyr.quantile_type
option. If you’re intending to match the SAS definition, you can use Type 3. For more information, see the stats::quantile()
documentation.
The example below demonstrates using the default quantile algorithm in R.
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(CUMDOSE) %>%
set_format_strings("Q1, Q3" = f_str('xxxxx, xxxxx', q1, q3))
%>%
) build() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
Q1, Q3 | 0, 0 | 2646, 13959 | 1984, 9801 |
This next example demonstrates using quantile algorithm Type 3, which matches the SAS definition of:
\[ \textrm{Nearest even order statistic. γ = 0 if g = 0 and j is even, and 1 otherwise.} \]
options(tplyr.quantile_type = 3)
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(CUMDOSE) %>%
set_format_strings("Q1, Q3" = f_str('xxxxx, xxxxx', q1, q3))
%>%
) build() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
Q1, Q3 | 0, 0 | 2565, 13959 | 1944, 9774 |
It’s up to you to determine which algorithm you should use - but we found it necessary to provide you with the flexibility to change this within the default summaries.
But what if ‘Tplyr’ doesn’t offer you the summaries that you need?
We understand that our defaults may not cover every descriptive statistic that you’d like to see. That’s why we’ve opened to door to creating custom summaries. Custom summaries allow you to provide any function you’d like into a desc layer. You can focus on the derivation and how to calculate the number you want to see. ‘Tplyr’ can consume this function, and use all the existing tools within ‘Tplyr’ to produce the string formatted result alongside any of the default summaries we provide as well.
Custom summaries may be provided in two ways:
tplyr.custom_summaries
option set at your session levelset_custom_summaries()
at the layer levelAs with any other setting in ‘Tplyr’, the layer setting will always take precedence over any other setting.
Let’s look at an example.
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(vars(AGE, HEIGHTBL), by = "Sepal Length") %>%
set_custom_summaries(
geometric_mean = exp(sum(log(.var[.var > 0]), na.rm=TRUE) / length(.var))
%>%
) set_format_strings(
'Geometric Mean (SD)' = f_str('xx.xx (xx.xxx)', geometric_mean, sd)
)%>%
) build() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | var2_Placebo | var2_Xanomeline High Dose | var2_Xanomeline Low Dose |
---|---|---|---|---|---|---|---|
Sepal Length | Geometric Mean (SD) | 74.70 ( 8.590) | 73.94 ( 7.886) | 75.18 ( 8.286) | 162.17 (11.522) | 165.51 (10.131) | 163.11 (10.419) |
Here, a few important things are demonstrated:
AGE
and HEIGHTBL
are being summarized in the same layer. AGE
results go to the var1_
variables and HEIGHTBL
results go to the var2_
variables.set_custom_summaries()
, or names on the left side of the equals, flow into set_format_strings()
in the f_str()
calls. Just like the default summaries, geometric_mean
becomes the name that you refer to in order to use the geometric mean derivation in a summary..var
. This may not seem intuitive. The reason we have to use .var
is so that, like in this example, the custom function can be applied to each of the separate target variables.Another note about custom summaries is that you’re able to overwrite the default summaries built into ‘Tplyr’ as well. Don’t like the default summary functions that we provide? Use the tplyr.custom_summaries
option to overwrite them in your session, and add any new ones that you would like to include.
For example, here we use the ‘Tplyr’ default mean.
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE) %>%
set_format_strings("Mean" = f_str('xx.xx', mean))
%>%
) build() %>%
kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|
Mean | 75.21 | 74.38 | 75.67 | 1 | 1 |
But now, let’s overwrite mean
using a custom summary. Let’s use a trimmed mean instead, taking 20% of observations off of both ends.
options(tplyr.custom_summaries =
::quos(
rlangmean = mean(.var, na.rm=TRUE, trim=0.4)
)
)
tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE) %>%
set_format_strings("Mean" = f_str('xx.xx', mean))
%>%
) build() %>%
kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|
Mean | 76.28 | 75.94 | 77.44 | 1 | 1 |
Note that the table code used to produce the output is the same. Now ‘Tplyr’ used the custom summary function for mean
as specified in the tplyr.custom_summaries
option. Also note the use of rlang::quos()
. We’ve done our best to mask this from the user everywhere possible and make the interfaces clean and intuitive, but a great deal of ‘Tplyr’ is built using ‘rlang’ and non-standard evaluation. Within this option is one of the very few instances where a user needs to concern themselves with the use of quosures. If you’d like to learn more about non-standard evaluation and quosures, we recommend Section IV in Advanced R.
A lot of the nuance to formatting descriptive statistics layers has already been covered above, but there are a couple more tricks to getting the most out of ‘Tplyr’. One of these tricks is filling empty values.
By default, if there is no available value for a summary in a particular observation, the result being presented will be blanked out.
<- adlb %>%
adlb_2 filter(TRTA != "Placebo")
tplyr_table(adlb_2, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01P) %>%
add_layer(
group_desc(AVAL, by=PARAMCD) %>%
set_format_strings('Mean (SD)' = f_str('xxx (xxx)', mean, sd))
%>%
) build() %>%
head() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | row_label2 | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
BUN | Mean (SD) | 5 ( 1) | 6 ( 3) |
CA | Mean (SD) | 2 ( 0) | 2 ( 0) |
CK | Mean (SD) | 64 ( 94) | 58 ( 78) |
GGT | Mean (SD) | 17 ( 49) | 21 ( 27) |
URATE | Mean (SD) | 271 ( 88) | 231 ( 87) |
Note how the entire example above has all records in var1_Placebo
missing. ‘Tplyr’ gives you control over how you fill this space. Let’s say that we wanted instead to make that space say “Missing”. You can control this with the f_str()
object using the empty
parameter
<- adlb %>%
adlb_2 filter(TRTA != "Placebo")
tplyr_table(adlb_2, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01P) %>%
add_layer(
group_desc(AVAL, by=PARAMCD) %>%
set_format_strings('Mean (SD)' = f_str('xxx.xx (xxx.xxx)', mean, sd, empty=c(.overall="MISSING")))
%>%
) build() %>%
head() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | row_label2 | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
BUN | Mean (SD) | 4.57 ( 1.301) | 5.71 ( 2.940) |
CA | Mean (SD) | 2.19 ( 0.137) | 2.15 ( 0.083) |
CK | Mean (SD) | 64.25 ( 93.986) | 58.33 ( 77.915) |
GGT | Mean (SD) | 16.75 ( 48.692) | 21.33 ( 26.989) |
URATE | Mean (SD) | 271.23 ( 88.161) | 230.98 ( 87.006) |
Look at the empty
parameter above. Here, we use a named character vector, where the name is .overall
. When this name is used, if all elements within the cell are missing, they will be filled with the specified text. Otherwise, the provided string will fill just the missing parameter. In some cases, this may not be what you’d like to see. Perhaps we want a string that fills each missing space.
<- adlb %>%
adlb_2 filter(TRTA != "Placebo")
tplyr_table(adlb_2, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01P) %>%
add_layer(
group_desc(AVAL, by=PARAMCD) %>%
set_format_strings('Mean (SD)' = f_str('xxx.xx (xxx.xxx)', mean, sd, empty=c("NA")))
%>%
) build() %>%
head() %>%
select(-starts_with("ord")) %>%
kable()
row_label1 | row_label2 | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
BUN | Mean (SD) | 4.57 ( 1.301) | 5.71 ( 2.940) |
CA | Mean (SD) | 2.19 ( 0.137) | 2.15 ( 0.083) |
CK | Mean (SD) | 64.25 ( 93.986) | 58.33 ( 77.915) |
GGT | Mean (SD) | 16.75 ( 48.692) | 21.33 ( 26.989) |
URATE | Mean (SD) | 271.23 ( 88.161) | 230.98 ( 87.006) |
In the example above, instead of filling the whole space, the empty
text of “NA” replaces the empty value for each element. So for ‘Mean (SD)’, we now have ‘NA ( NA)’. Note that the proper padding was still used for ‘NA’ to make sure the parentheses still align with populated records.
You may have noticed that the approach to formatting covered so far leaves a lot to be desired. Consider analyzing lab results, where you may want precision to vary based on the collected precision of the tests. Furthermore, depending on the summary being presented, you may wish to increase the precision further. For example, you may want the mean to be at collected precision +1 decimal place, and for standard deviation +2.
‘Tplyr’ has this covered using auto-precision. Auto-precision allows you to format your numeric summaries based on the precision of the data collected. This has all been built into the format strings, because a natural place to specify your desired format is where you specify how you want your data presented. If you wish to use auto-precision, use a
instead of x
when creating your summaries. Note that only one a
is needed on each side of a decimal. To use increased precision, use a+n
where n
is the number of additional spaces you wish to add.
tplyr_table(adlb, TRTA) %>%
add_layer(
group_desc(AVAL, by = PARAMCD) %>%
set_format_strings(
'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd)
)%>%
) build() %>%
head(20) %>%
kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
BUN | Mean (SD) | 4.7430 ( 2.05463) | 4.5696 ( 1.30148) | 5.7120 ( 2.94018) | 1 | 1 | 1 |
CA | Mean (SD) | 2.165660 (0.0692494) | 2.189362 (0.1372011) | 2.145700 (0.0830867) | 1 | 2 | 1 |
CK | Mean (SD) | 72.4 ( 288.41) | 64.2 ( 93.99) | 58.3 ( 77.91) | 1 | 3 | 1 |
GGT | Mean (SD) | 17.8 ( 34.77) | 16.8 ( 48.69) | 21.3 ( 26.99) | 1 | 4 | 1 |
URATE | Mean (SD) | 235.9373 ( 83.69662) | 271.2288 ( 88.16093) | 230.9807 ( 87.00646) | 1 | 5 | 1 |
As you can see, the decimal precision is now varying depending on the test being performed. Notice that both the integer and the decimal side of each number fluctuate as well. Tpylr
collects both the integer and decimal precision, and you can specify both separately. For example, you could use x
’s to specify a default number of spaces for your integers that are used consistently across by variables, but vary the decimal precision based on collected data. You can also increment the number of spaces for both integer and decimal separately.
But - this is kind of ugly, isn’t it? Do we really need all 6 decimal places collected for CA? For this reason, you’re able to set a cap on the precision that’s displayed:
tplyr_table(adlb, TRTA) %>%
add_layer(
group_desc(AVAL, by = PARAMCD) %>%
set_format_strings(
'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd),
cap = c(int=3, dec=2)
)%>%
) build() %>%
head(20) %>%
kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
BUN | Mean (SD) | 4.743 ( 2.0546) | 4.570 ( 1.3015) | 5.712 ( 2.9402) | 1 | 1 | 1 |
CA | Mean (SD) | 2.166 (0.0692) | 2.189 (0.1372) | 2.146 (0.0831) | 1 | 2 | 1 |
CK | Mean (SD) | 72.4 (288.41) | 64.2 ( 93.99) | 58.3 ( 77.91) | 1 | 3 | 1 |
GGT | Mean (SD) | 17.8 ( 34.77) | 16.8 ( 48.69) | 21.3 ( 26.99) | 1 | 4 | 1 |
URATE | Mean (SD) | 235.937 ( 83.6966) | 271.229 ( 88.1609) | 230.981 ( 87.0065) | 1 | 5 | 1 |
Now that looks better. The cap
argument is part of set_format_strings()
. You need to specify the integer and decimal caps separately. Note that integer precision works slightly differently than decimal precision. Integer precision relates to the length allotted for the left side of a decimal, but integers will not truncate. When using ‘x’ formatting, if an integer exceeds the set length, it will push the number over. If the integer side of auto-precision is not capped, the necessary length for an integer in the associated by group will be as long as necessary. Decimals, on the other hand, round to the specified length. These caps apply to the length allotted for the “a” on either the integer or the decimal. So for example, if the decimal length is capped at 2 and the selected precision is “a+1”, then 3 decimal places will be allotted.
This was a basic situation, but if you’re paying close attention, you may have some questions. What if you have more by variables, like by visit AND test. Do we then calculate precision by visit and test? What if collected precision is different per visit and we don’t want that? What about multiple summary variables? How do we determine precision then? We have modifier functions for this:
tplyr_table(adlb, TRTA) %>%
add_layer(
group_desc(vars(AVAL, CHG, BASE), by = PARAMCD) %>%
set_format_strings(
'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd, empty="NA"),
cap = c(int=3, dec=2)
%>%
) set_precision_on(AVAL) %>%
set_precision_by(PARAMCD)
%>%
) build() %>%
head() %>%
kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | var2_Placebo | var2_Xanomeline High Dose | var2_Xanomeline Low Dose | var3_Placebo | var3_Xanomeline High Dose | var3_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BUN | Mean (SD) | 4.743 ( 2.0546) | 4.570 ( 1.3015) | 5.712 ( 2.9402) | -0.765 ( 1.6734) | -0.428 ( 1.5910) | -0.428 ( 1.3299) | 5.457 ( 1.5661) | 4.712 ( 1.6382) | 6.497 ( 2.7940) | 1 | 1 | 1 |
CA | Mean (SD) | 2.166 (0.0692) | 2.189 (0.1372) | 2.146 (0.0831) | -0.105 (0.0972) | -0.100 (0.1242) | -0.141 (0.0621) | 2.290 (0.0923) | 2.289 (0.0828) | 2.304 (0.0719) | 1 | 2 | 1 |
CK | Mean (SD) | 72.4 (288.41) | 64.2 ( 93.99) | 58.3 ( 77.91) | -3.9 (275.75) | -15.2 ( 77.56) | -14.7 ( 51.54) | 83.4 ( 38.13) | 84.8 ( 64.27) | 72.3 ( 35.71) | 1 | 3 | 1 |
GGT | Mean (SD) | 17.8 ( 34.77) | 16.8 ( 48.69) | 21.3 ( 26.99) | -1.2 ( 21.67) | -1.5 ( 41.48) | -0.3 ( 20.55) | 24.2 ( 19.36) | 18.8 ( 25.84) | 22.7 ( 11.05) | 1 | 4 | 1 |
URATE | Mean (SD) | 235.937 ( 83.6966) | 271.229 ( 88.1609) | 230.981 ( 87.0065) | -23.792 ( 37.1799) | -28.550 ( 55.6673) | -40.645 ( 24.7959) | 272.617 ( 65.7021) | 310.486 ( 61.8285) | 273.608 ( 86.9470) | 1 | 5 | 1 |
Three variables are being summarized here - AVAL, CHG, and BASE. So which should be used for precision? set_precision_on()
allows you to specify this, where the precision_on()
variable must be one of the variables within target_var
. Similarly, set_precision_by()
changes the by
variables used to determine collected precision. If no precision_on()
variable is specified, the first variable in target_var
is used. If no precision_by()
variables are specified, then the default by
variables are used.