Table 1

2022-04-14

Making Table 1

library(furniture)

This vignette demonstrates the main function of the furniture package–table1(). This vignette is current as of furniture 1.9.12.

The main parts of the table1() are below:

table1(.data, ..., splitby, row_wise, test, type, output, format_number, na.rm)

It contains several useful features for summarizing your data:

  1. It simply summarizes many variables succinctly providing means/counts and SD’s/percentages. By providing variable names to the medians option, you can obtain the median and the first quartile/third quantile.
  2. The descriptive statistics can be by a grouping factor (i.e., splitby).
  3. It uses a similar API to the popular tidyverse groups of packages and can be used in a pipe.
  4. It can give bivariate test results for the variable with the grouping variable, which provides the correct test type depending on the variable types.
  5. It is flexible as to its output: can be printed in regular console output or it can be printed in latex (either through kable or through a built-in function), markdown, and pandoc (see knitr::kable).
  6. Numbers can be formatted nicely.
  7. Table has multiple formatting options to fit various needs using output, format_output, simple and condense.
  8. The table can be exported to a CSV with export = "file_name".

To illustrate, we’ll walk through the main arguments with an example on some fictitious data.

Example

set.seed(84332)
## Create Fictitious Data containing several types of variables
df <- data.frame(a = sample(1:10000, 10000, replace = TRUE),
                 b = runif(10000) + rnorm(10000),
                 c = factor(sample(c(1,2,3,4,NA), 10000, replace=TRUE)),
                 d = factor(sample(c(0,1,NA), 10000, replace=TRUE)),
                 e = trunc(rnorm(10000, 20, 5)),
                 f = factor(sample(c(0,1,NA), 10000, replace=TRUE)))

We will use df to show these main features of table1.

The …

For table1, the ellipses (the ...), are the variables to be summarized that are found in your data. Here, we have a through e in df.

table1(df, 
       a, b, c, d, e)
## 
## ────────────────────────
##       Mean/Count (SD/%)
##       n = 5362         
##  a                     
##       4938.4 (2858.0)  
##  b                     
##       0.5 (1.0)        
##  c                     
##     1 1306 (24.4%)     
##     2 1352 (25.2%)     
##     3 1377 (25.7%)     
##     4 1327 (24.7%)     
##  d                     
##     0 2747 (51.2%)     
##     1 2615 (48.8%)     
##  e                     
##       19.5 (5.0)       
## ────────────────────────

Splitby

To get means/count and SD’s/percentages by a stratifying variable, simply use the splitby argument. The splitby can be a quoted variable (e.g., "df") or can be a one-sided formula as shown below (e.g., ~d).

table1(df,
       a, b, c,
       splitby = ~d)
## 
## ──────────────────────────────────────
##                    d 
##       0               1              
##       n = 2747        n = 2615       
##  a                                   
##       4941.4 (2863.8) 4935.2 (2852.5)
##  b                                   
##       0.5 (1.1)       0.5 (1.0)      
##  c                                   
##     1 648 (23.6%)     658 (25.2%)    
##     2 694 (25.3%)     658 (25.2%)    
##     3 739 (26.9%)     638 (24.4%)    
##     4 666 (24.2%)     661 (25.3%)    
## ──────────────────────────────────────

Row Wise

You can get percentages by rows instead of by columns (i.e., groups) by using the row_wise = TRUE option.

table1(df,
       a, b, c,
       splitby = ~d,
       row_wise = TRUE)
## 
## ──────────────────────────────────────
##                    d 
##       0               1              
##       n = 2747        n = 2615       
##  a                                   
##       4941.4 (2863.8) 4935.2 (2852.5)
##  b                                   
##       0.5 (1.1)       0.5 (1.0)      
##  c                                   
##     1 648 (49.6%)     658 (50.4%)    
##     2 694 (51.3%)     658 (48.7%)    
##     3 739 (53.7%)     638 (46.3%)    
##     4 666 (50.2%)     661 (49.8%)    
## ──────────────────────────────────────

Test

It is easy to test for bivariate relationships, as in common in many Table 1’s, using test = TRUE.

table1(df,
       a, b, c,
       splitby = ~d,
       test = TRUE)
## 
## ──────────────────────────────────────────────
##                    d 
##       0               1               P-Value
##       n = 2747        n = 2615               
##  a                                    0.937  
##       4941.4 (2863.8) 4935.2 (2852.5)        
##  b                                    0.241  
##       0.5 (1.1)       0.5 (1.0)              
##  c                                    0.157  
##     1 648 (23.6%)     658 (25.2%)            
##     2 694 (25.3%)     658 (25.2%)            
##     3 739 (26.9%)     638 (24.4%)            
##     4 666 (24.2%)     661 (25.3%)            
## ──────────────────────────────────────────────

By default, only the p-values are shown but other options exist such as stars or including the test statistics with the p-values using the format_output argument.

Simple and Condensed

The table can be simplified by just producing percentages for categorical variables. Further, it can be condensed by providing only a reference group’s percentages for binary variables and the means and SD’s are provided on the same line as the variable name.

table1(df,
       f, a, b, c,
       splitby = ~d,
       test = TRUE,
       type = c("simple", "condensed"))
## 
## ──────────────────────────────────────────────
##                    d 
##       0               1               P-Value
##       n = 1801        n = 1720               
##  f: 1 50%             49.7%           0.903  
##  a    4938.7 (2874.3) 4890.0 (2839.7) 0.613  
##  b    0.5 (1.1)       0.5 (1.0)       0.308  
##  c                                    0.016  
##     1 22.5%           25.4%                  
##     2 25%             25.4%                  
##     3 27.9%           23.5%                  
##     4 24.5%           25.7%                  
## ──────────────────────────────────────────────

Medians

If the medians and the interquartile range is desired instead of means and SD’s, simply use the second argument:

table1(df,
       f, a, b, c,
       splitby = ~d,
       test = TRUE,
       type = c("simple", "condensed"),
       second = c("a", "b"))
## 
## ──────────────────────────────────────────────
##                    d 
##       0               1               P-Value
##       n = 1801        n = 1720               
##  f: 1 50%             49.7%           0.903  
##  a    4930.0 [5106.0] 4906.0 [4931.0] 0.613  
##  b    0.5 [1.4]       0.5 [1.4]       0.308  
##  c                                    0.016  
##     1 22.5%           25.4%                  
##     2 25%             25.4%                  
##     3 27.9%           23.5%                  
##     4 24.5%           25.7%                  
## ──────────────────────────────────────────────

Output Type

Several output types exist for the table (all of the knitr::kable options) including html as shown below. Others include:

  1. “latex”
  2. “markdown”
  3. “pandoc”
table1(df,
       a, b, c,
       splitby = ~d,
       test = TRUE,
       output = "html")
0 1 P-Value
n = 2747 n = 2615
a 0.937
4941.4 (2863.8) 4935.2 (2852.5)
b 0.241
0.5 (1.1) 0.5 (1.0)
c 0.157
1 648 (23.6%) 658 (25.2%)
2 694 (25.3%) 658 (25.2%)
3 739 (26.9%) 638 (24.4%)
4 666 (24.2%) 661 (25.3%)

Format Number

For some papers you may want to format the numbers by inserting a comma in as a placeholder in big numbers (e.g., 30,000 vs. 30000). You can do this by using format_number = TRUE.

table1(df,
       a, b, c,
       splitby = ~d,
       test = TRUE,
       format_number = TRUE)
## 
## ──────────────────────────────────────────────────
##                      d 
##       0                 1                 P-Value
##       n = 2747          n = 2615                 
##  a                                        0.937  
##       4,941.4 (2,863.8) 4,935.2 (2,852.5)        
##  b                                        0.241  
##       0.5 (1.1)         0.5 (1.0)                
##  c                                        0.157  
##     1 648 (23.6%)       658 (25.2%)              
##     2 694 (25.3%)       658 (25.2%)              
##     3 739 (26.9%)       638 (24.4%)              
##     4 666 (24.2%)       661 (25.3%)              
## ──────────────────────────────────────────────────

na.rm

In order to explore the missingness in the factor variables, using na.rm = FALSE does the counts and percentages of the missing values as well.

table1(df,
       a, b, c,
       splitby = ~d,
       test = TRUE,
       na.rm = FALSE)
## 
## ───────────────────────────────────────────────
##                     d 
##        0               1               P-Value
##        n = 3430        n = 3269               
##  a                                     0.479  
##        4918.2 (2861.5) 4967.6 (2863.9)        
##  b                                     0.374  
##        0.5 (1.1)       0.5 (1.0)              
##  c                                     0.157  
##     1  648 (18.9%)     658 (20.1%)            
##     2  694 (20.2%)     658 (20.1%)            
##     3  739 (21.5%)     638 (19.5%)            
##     4  666 (19.4%)     661 (20.2%)            
##     NA 683 (19.9%)     654 (20%)              
## ───────────────────────────────────────────────

Here we do not have any missingness but it shows up as zeros to show that there are none there.

Piping

Finally, and very importantly, to make it easier to implement in the tidyverse of packages, a piping option is available. This option can use a grouped_df object output by dplyr::group_by() and use the groups indicated there as shown below.

library(dplyr)

df %>%
  filter(f == 1) %>%
  group_by(d) %>%
  table1(a, b, c,
         test = TRUE,
         type = c("simple", "condensed"))
## 
## ──────────────────────────────────────────────
##                    d 
##       0               1               P-Value
##       n = 900         n = 855                
##  a    4971.3 (2861.1) 4820.0 (2849.9) 0.267  
##  b    0.6 (1.0)       0.5 (1.0)       0.528  
##  c                                    0.149  
##     1 22.6%           24.9%                  
##     2 25.3%           24.8%                  
##     3 27.2%           22.9%                  
##     4 24.9%           27.4%                  
## ──────────────────────────────────────────────

This includes the ability to use multiple grouping variables. The first value is the first grouping variable, then an underscore, followed by the value of the second grouping variable.

df %>%
  group_by(d, f) %>%
  table1(a, b, c,
         test = TRUE,
         type = c("simple", "condensed"))
## 
## ──────────────────────────────────────────────────────────────────────────────
##                                   d, f 
##       0-0             1-0             0-1             1-1             P-Value
##       n = 901         n = 865         n = 900         n = 855                
##  a    4906.2 (2888.6) 4959.2 (2829.6) 4971.3 (2861.1) 4820.0 (2849.9) 0.68   
##  b    0.5 (1.1)       0.5 (1.0)       0.6 (1.0)       0.5 (1.0)       0.629  
##  c                                                                    0.145  
##     1 22.4%           25.9%           22.6%           24.9%                  
##     2 24.8%           26%             25.3%           24.8%                  
##     3 28.6%           24%             27.2%           22.9%                  
##     4 24.2%           24%             24.9%           27.4%                  
## ──────────────────────────────────────────────────────────────────────────────

Variable Names

You can also adjust the variable names from within the function as so:

table1(df,
       "Avar" = a, "Bvar" = b, "Cvar" = c,
       splitby = ~d,
       test = TRUE)
## 
## ──────────────────────────────────────────────
##                    d 
##       0               1               P-Value
##       n = 2747        n = 2615               
##  Avar                                 0.937  
##       4941.4 (2863.8) 4935.2 (2852.5)        
##  Bvar                                 0.241  
##       0.5 (1.1)       0.5 (1.0)              
##  Cvar                                 0.157  
##     1 648 (23.6%)     658 (25.2%)            
##     2 694 (25.3%)     658 (25.2%)            
##     3 739 (26.9%)     638 (24.4%)            
##     4 666 (24.2%)     661 (25.3%)            
## ──────────────────────────────────────────────

This is particularly useful when you adjust a variable within the function:

df %>% 
  group_by(d) %>% 
  table1("A" = factor(ifelse(a > 500, 1, 0)), b, c,
         test = TRUE)
## 
## ────────────────────────────────────────
##                 d 
##       0            1            P-Value
##       n = 2747     n = 2615            
##  A                              0.507  
##     0 130 (4.7%)   135 (5.2%)          
##     1 2617 (95.3%) 2480 (94.8%)        
##  b                              0.241  
##       0.5 (1.1)    0.5 (1.0)           
##  c                              0.157  
##     1 648 (23.6%)  658 (25.2%)         
##     2 694 (25.3%)  658 (25.2%)         
##     3 739 (26.9%)  638 (24.4%)         
##     4 666 (24.2%)  661 (25.3%)         
## ────────────────────────────────────────

Here we changed a to a factor within the function. In order for the name to look better, we can assign a new name, otherwise it would be named something like factor.ifelse.a....

Final Note

As a final note, the "table1" object can be coerced to a data.frame very easily:

tab1 <- table1(df,
               a, b, c,
               splitby = ~d,
               test = TRUE)
as.data.frame(tab1)
##       .               0               1 P.Value
## 1              n = 2747        n = 2615        
## 2     a                                   0.937
## 3       4941.4 (2863.8) 4935.2 (2852.5)        
## 4     b                                   0.241
## 5             0.5 (1.1)       0.5 (1.0)        
## 6     c                                   0.157
## 7     1     648 (23.6%)     658 (25.2%)        
## 8     2     694 (25.3%)     658 (25.2%)        
## 9     3     739 (26.9%)     638 (24.4%)        
## 10    4     666 (24.2%)     661 (25.3%)

Conclusions

table1 can be a valuable addition to the tools that are being utilized to analyze descriptive statistics. Enjoy this valuable piece of furniture!