The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers three basic functions to quickly output relevant statistics:
describe()
for continuous variablesdescribe_cat()
for categorical variablestab_frequencies()
for categorical variablesFor demonstration purposes, we will use sample data from the Worlds of Journalism 2012-16 study included in tidycomm.
WoJ#> # A tibble: 1,200 x 15
#> country reach employment temp_contract autonomy_selecti~ autonomy_emphas~
#> <fct> <fct> <chr> <fct> <dbl> <dbl>
#> 1 Germany Nation~ Full-time Permanent 5 4
#> 2 Germany Nation~ Full-time Permanent 3 4
#> 3 Switzerl~ Region~ Full-time Permanent 4 4
#> 4 Switzerl~ Local Part-time Permanent 4 5
#> 5 Austria Nation~ Part-time Permanent 4 4
#> 6 Switzerl~ Local Freelancer <NA> 4 4
#> 7 Germany Local Full-time Permanent 4 4
#> 8 Denmark Nation~ Full-time Permanent 3 3
#> 9 Switzerl~ Local Full-time Permanent 5 5
#> 10 Denmark Nation~ Full-time Permanent 2 4
#> # ... with 1,190 more rows, and 9 more variables: ethics_1 <dbl>,
#> # ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>, work_experience <dbl>,
#> # trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
#> # trust_politicians <dbl>
describe()
outputs several measures of central tendency and variability for all variables named in the function call:
%>%
WoJ describe(autonomy_selection, autonomy_emphasis, work_experience)
#> # A tibble: 3 x 15
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonomy_selec~ 1197 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonomy_empha~ 1195 5 4.08 0.793 1 4 4 5 5 4
#> 3 work_experience 1187 13 17.8 10.9 1 8 17 25 53 52
#> # ... with 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
#> # Kurtosis <dbl>
If no variables are passed to describe()
, all numeric variables in the data are described:
%>%
WoJ describe()
#> # A tibble: 11 x 15
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonomy_sele~ 1197 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonomy_emph~ 1195 5 4.08 0.793 1 4 4 5 5 4
#> 3 ethics_1 1200 0 1.63 0.892 1 1 1 2 5 4
#> 4 ethics_2 1200 0 3.21 1.26 1 2 4 4 5 4
#> 5 ethics_3 1200 0 2.39 1.13 1 2 2 3 5 4
#> 6 ethics_4 1200 0 2.58 1.25 1 1.75 2 4 5 4
#> 7 work_experien~ 1187 13 17.8 10.9 1 8 17 25 53 52
#> 8 trust_parliam~ 1200 0 3.05 0.811 1 3 3 4 5 4
#> 9 trust_governm~ 1200 0 2.82 0.854 1 2 3 3 5 4
#> 10 trust_parties 1200 0 2.42 0.736 1 2 2 3 4 3
#> 11 trust_politic~ 1200 0 2.52 0.712 1 2 3 3 4 3
#> # ... with 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
#> # Kurtosis <dbl>
Data can be grouped before describing:
%>%
WoJ ::group_by(country) %>%
dplyrdescribe(autonomy_emphasis, autonomy_selection)
#> # A tibble: 10 x 16
#> # Groups: country [5]
#> country Variable N Missing M SD Min Q25 Mdn Q75 Max
#> <fct> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria autonomy_e~ 205 2 4.19 0.614 2 4 4 5 5
#> 2 Denmark autonomy_e~ 375 1 3.90 0.856 1 4 4 4 5
#> 3 Germany autonomy_e~ 172 1 4.34 0.818 1 4 5 5 5
#> 4 Switzerl~ autonomy_e~ 233 0 4.07 0.694 1 4 4 4 5
#> 5 UK autonomy_e~ 210 1 4.08 0.838 2 4 4 5 5
#> 6 Austria autonomy_s~ 207 0 3.92 0.637 2 4 4 4 5
#> 7 Denmark autonomy_s~ 376 0 3.76 0.892 1 3 4 4 5
#> 8 Germany autonomy_s~ 172 1 3.97 0.881 1 3 4 5 5
#> 9 Switzerl~ autonomy_s~ 233 0 3.92 0.628 1 4 4 4 5
#> 10 UK autonomy_s~ 209 2 3.91 0.867 1 3 4 5 5
#> # ... with 5 more variables: Range <dbl>, CI_95_LL <dbl>, CI_95_UL <dbl>,
#> # Skewness <dbl>, Kurtosis <dbl>
describe_cat()
outputs a short summary of categorical variables (number of unique values, mode, N of mode) of all variables named in the function call:
%>%
WoJ describe_cat(reach, employment, temp_contract)
#> # A tibble: 3 x 6
#> Variable N Missing Unique Mode Mode_N
#> <chr> <int> <int> <int> <chr> <int>
#> 1 reach 1200 0 4 National 617
#> 2 employment 1200 0 3 Full-time 902
#> 3 temp_contract 1001 199 3 Permanent 948
If no variables are passed to describe_cat()
, all categorical variables (i.e., character
and factor
variables) in the data are described:
%>%
WoJ describe_cat()
#> # A tibble: 4 x 6
#> Variable N Missing Unique Mode Mode_N
#> <chr> <int> <int> <int> <chr> <int>
#> 1 country 1200 0 5 Denmark 376
#> 2 reach 1200 0 4 National 617
#> 3 employment 1200 0 3 Full-time 902
#> 4 temp_contract 1001 199 3 Permanent 948
Data can be grouped before describing:
%>%
WoJ ::group_by(reach) %>%
dplyrdescribe_cat(country, employment)
#> # A tibble: 8 x 7
#> # Groups: reach [4]
#> reach Variable N Missing Unique Mode Mode_N
#> <fct> <chr> <int> <int> <int> <chr> <int>
#> 1 Local country 149 0 5 Germany 47
#> 2 Regional country 355 0 5 Switzerland 90
#> 3 National country 617 0 5 Denmark 262
#> 4 Transnational country 79 0 4 UK 72
#> 5 Local employment 149 0 3 Full-time 111
#> 6 Regional employment 355 0 3 Full-time 287
#> 7 National employment 617 0 3 Full-time 438
#> 8 Transnational employment 79 0 3 Full-time 66
tab_frequencies()
outputs absolute and relative frequencies of all unique values of one or more categorical variables:
%>%
WoJ tab_frequencies(employment)
#> # A tibble: 3 x 5
#> employment n percent cum_n cum_percent
#> <chr> <int> <dbl> <int> <dbl>
#> 1 Freelancer 172 0.143 172 0.143
#> 2 Full-time 902 0.752 1074 0.895
#> 3 Part-time 126 0.105 1200 1
Passing more than one variable will compute relative frequencies based on all combinations of unique values:
%>%
WoJ tab_frequencies(employment, country)
#> # A tibble: 15 x 6
#> employment country n percent cum_n cum_percent
#> <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0133 16 0.0133
#> 2 Freelancer Denmark 85 0.0708 101 0.0842
#> 3 Freelancer Germany 29 0.0242 130 0.108
#> 4 Freelancer Switzerland 10 0.00833 140 0.117
#> 5 Freelancer UK 32 0.0267 172 0.143
#> 6 Full-time Austria 165 0.138 337 0.281
#> 7 Full-time Denmark 275 0.229 612 0.51
#> 8 Full-time Germany 139 0.116 751 0.626
#> 9 Full-time Switzerland 154 0.128 905 0.754
#> 10 Full-time UK 169 0.141 1074 0.895
#> 11 Part-time Austria 26 0.0217 1100 0.917
#> 12 Part-time Denmark 16 0.0133 1116 0.93
#> 13 Part-time Germany 5 0.00417 1121 0.934
#> 14 Part-time Switzerland 69 0.0575 1190 0.992
#> 15 Part-time UK 10 0.00833 1200 1
You can also group your data before. This will lead to within-group relative frequencies:
%>%
WoJ ::group_by(country) %>%
dplyrtab_frequencies(employment)
#> # A tibble: 15 x 6
#> # Groups: country [5]
#> employment country n percent cum_n cum_percent
#> <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0773 16 0.0773
#> 2 Full-time Austria 165 0.797 181 0.874
#> 3 Part-time Austria 26 0.126 207 1
#> 4 Freelancer Denmark 85 0.226 85 0.226
#> 5 Full-time Denmark 275 0.731 360 0.957
#> 6 Part-time Denmark 16 0.0426 376 1
#> 7 Freelancer Germany 29 0.168 29 0.168
#> 8 Full-time Germany 139 0.803 168 0.971
#> 9 Part-time Germany 5 0.0289 173 1
#> 10 Freelancer Switzerland 10 0.0429 10 0.0429
#> 11 Full-time Switzerland 154 0.661 164 0.704
#> 12 Part-time Switzerland 69 0.296 233 1
#> 13 Freelancer UK 32 0.152 32 0.152
#> 14 Full-time UK 169 0.801 201 0.953
#> 15 Part-time UK 10 0.0474 211 1
(Compare the columns percent
, cum_n
and cum_percent
with the output above.)