The package includes functions for power analysis and sample size calculation for Welch and Hsu (Hedderich and Sachs 2018) t tests including Monte-Carlo simulations of empirical power and type-I-error. In addition, power and sample size calculation for Wilcoxon rank sum and signed rank tests via Monte-Carlo simulations can be performed. Moreover, power and sample size required for the evaluation of a diagnostic test(-system) (Flahault, Cadilhac, and Thomas 2005; Dobbin and Simon 2007) as well as for a single proportion (Fleiss, Levin, and Paik 2003) and comparing two negative binomial rates (Zhu and Lakkis 2014) can be calculated.
We first load the package.
library(MKpower)
The computation of the required sample size for investigating a single proportion can be determined via the respective test or confidence interval (Fleiss, Levin, and Paik 2003). First, we consider the asymptotic test.
## with continuity correction
power.prop1.test(p1 = 0.4, power = 0.8)
##
## Power calculation for testing a given proportion (with continuity correction)
##
## n = 203.7246
## delta = 0.1
## p1 = 0.4
## p0 = 0.5
## sig.level = 0.05
## exact.sig.level = 0.04205139
## power = 0.8
## exact.power = 0.8008049
## alternative = two.sided
##
## NOTE: n = total sample size
## without continuity correction
power.prop1.test(p1 = 0.4, power = 0.8, cont.corr = FALSE)
##
## Power calculation for testing a given proportion
##
## n = 193.8473
## delta = 0.1
## p1 = 0.4
## p0 = 0.5
## sig.level = 0.05
## exact.sig.level = 0.05228312
## power = 0.8
## exact.power = 0.8067065
## alternative = two.sided
##
## NOTE: n = total sample size
Next, we compute the sample size via the respective asymptotic confidence interval.
## with continuity correction
ssize.propCI(prop = 0.4, width = 0.14)
##
## Sample size calculation by method of wald-cc
##
## n = 202.1865
## prop = 0.4
## width = 0.14
## conf.level = 0.95
##
## NOTE: Two-sided confidence interval
## without continuity correction
ssize.propCI(prop = 0.4, width = 0.14, method = "wald")
##
## Sample size calculation by method of wald
##
## n = 188.1531
## prop = 0.4
## width = 0.14
## conf.level = 0.95
##
## NOTE: Two-sided confidence interval
For computing the sample size of the Welch t-test, we only consider the situation of equal group size (balanced design).
## identical results as power.t.test, since sd = sd1 = sd2 = 1
power.welch.t.test(n = 20, delta = 1)
##
## Two-sample Welch t test power calculation
##
## n = 20
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.8689528
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.welch.t.test(power = .90, delta = 1)
##
## Two-sample Welch t test power calculation
##
## n = 22.0211
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.welch.t.test(power = .90, delta = 1, alternative = "one.sided")
##
## Two-sample Welch t test power calculation
##
## n = 17.84713
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = one.sided
##
## NOTE: n is number in *each* group
## sd1 = 0.5, sd2 = 1
power.welch.t.test(delta = 1, sd1 = 0.5, sd2 = 1, power = 0.9)
##
## Two-sample Welch t test power calculation
##
## n = 14.53583
## delta = 1
## sd1 = 0.5
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
For computing the sample size of the Hsu t-test (Hedderich and Sachs 2018), we only consider the situation of equal group size (balanced design).
## slightly more conservative than Welch t-test
power.hsu.t.test(n = 20, delta = 1)
##
## Two-sample Hsu t test power calculation
##
## n = 20
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.8506046
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.hsu.t.test(power = .90, delta = 1)
##
## Two-sample Hsu t test power calculation
##
## n = 23.02186
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.hsu.t.test(power = .90, delta = 1, alternative = "one.sided")
##
## Two-sample Hsu t test power calculation
##
## n = 18.5674
## delta = 1
## sd1 = 1
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = one.sided
##
## NOTE: n is number in *each* group
## sd1 = 0.5, sd2 = 1
power.welch.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9)
##
## Two-sample Welch t test power calculation
##
## n = 53.86822
## delta = 0.5
## sd1 = 0.5
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.hsu.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9)
##
## Two-sample Hsu t test power calculation
##
## n = 54.49421
## delta = 0.5
## sd1 = 0.5
## sd2 = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
For computing the sample size of these tests, we offer a function that performs Monte-Carlo simulations to determine their (empirical) power.
rx <- function(n) rnorm(n, mean = 0, sd = 1)
ry <- function(n) rnorm(n, mean = 0.5, sd = 1)
## two-sample
sim.ssize.wilcox.test(rx = rx, ry = ry, n.max = 100, iter = 1000)
##
## Wilcoxon rank sum test
##
## n = 10, 20, 30, 40, 50, 60, 70
## rx = rnorm(n, mean = 0, sd = 1)
## ry = rnorm(n, mean = 0.5, sd = 1)
## sig.level = 0.05
## emp.power = 0.163, 0.325, 0.454, 0.596, 0.658, 0.756, 0.814
## alternative = two.sided
##
## NOTE: n is number in *each* group
sim.ssize.wilcox.test(rx = rx, ry = ry, n.min = 65, n.max = 70,
step.size = 1, iter = 1000, BREAK = FALSE)
##
## Wilcoxon rank sum test
##
## n = 65, 66, 67, 68, 69, 70
## rx = rnorm(n, mean = 0, sd = 1)
## ry = rnorm(n, mean = 0.5, sd = 1)
## sig.level = 0.05
## emp.power = 0.791, 0.798, 0.791, 0.779, 0.811, 0.840
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.t.test(delta = 0.5, power = 0.8)
##
## Two-sample t test power calculation
##
## n = 63.76576
## delta = 0.5
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
## one-sample
sim.ssize.wilcox.test(rx = ry, n.max = 100, iter = 1000, type = "one.sample")
##
## Wilcoxon signed rank test
##
## n = 10, 20, 30, 40
## rx = rnorm(n, mean = 0.5, sd = 1)
## sig.level = 0.05
## emp.power = 0.272, 0.550, 0.729, 0.851
## alternative = two.sided
##
## NOTE: n is number in *each* group
sim.ssize.wilcox.test(rx = ry, type = "one.sample", n.min = 33,
n.max = 38, step.size = 1, iter = 1000, BREAK = FALSE)
##
## Wilcoxon signed rank test
##
## n = 33, 34, 35, 36, 37, 38
## rx = rnorm(n, mean = 0.5, sd = 1)
## sig.level = 0.05
## emp.power = 0.776, 0.766, 0.798, 0.820, 0.812, 0.814
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.t.test(delta = 0.5, power = 0.8, type = "one.sample")
##
## One-sample t test power calculation
##
## n = 33.3672
## delta = 0.5
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
## two-sample
rx <- function(n) rgamma(n, scale = 2, shape = 1.5)
ry <- function(n) rgamma(n, scale = 4, shape = 1.5) # equal shape
## two-sample
sim.ssize.wilcox.test(rx = rx, ry = ry, n.max = 100, iter = 1000)
##
## Wilcoxon rank sum test
##
## n = 10, 20, 30
## rx = rgamma(n, scale = 2, shape = 1.5)
## ry = rgamma(n, scale = 4, shape = 1.5)
## sig.level = 0.05
## emp.power = 0.323, 0.632, 0.825
## alternative = two.sided
##
## NOTE: n is number in *each* group
sim.ssize.wilcox.test(rx = rx, ry = ry, n.min = 25, n.max = 30,
step.size = 1, iter = 1000, BREAK = FALSE)
##
## Wilcoxon rank sum test
##
## n = 25, 26, 27, 28, 29, 30
## rx = rgamma(n, scale = 2, shape = 1.5)
## ry = rgamma(n, scale = 4, shape = 1.5)
## sig.level = 0.05
## emp.power = 0.736, 0.747, 0.757, 0.797, 0.790, 0.807
## alternative = two.sided
##
## NOTE: n is number in *each* group
For pratical applications we recommend to use a higher number of iterations. For more detailed examples we refer to the help page of the function.
When we consider two negative binomial rates (Zhu and Lakkis 2014), we can compute sample size or power applying function power.nb.test.
## examples from Table III in Zhu and Lakkis (2014)
power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 1)
##
## Power calculation for comparing two negative binomial rates
##
## n = 22.37386
## n1 = 22.37386
## mu0 = 5
## RR = 2
## theta = 2
## duration = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n = sample size of control group, n1 = sample size of treatment group
power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 2)
##
## Power calculation for comparing two negative binomial rates
##
## n = 21.23734
## n1 = 21.23734
## mu0 = 5
## RR = 2
## theta = 2
## duration = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n = sample size of control group, n1 = sample size of treatment group
power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 3)
##
## Power calculation for comparing two negative binomial rates
##
## n = 20.85564
## n1 = 20.85564
## mu0 = 5
## RR = 2
## theta = 2
## duration = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n = sample size of control group, n1 = sample size of treatment group
Given an expected sensitivity and specificity we can compute sample size, power, delta or significance level of diagnostic test (Flahault, Cadilhac, and Thomas 2005).
## see n2 on page 1202 of Chu and Cole (2007)
power.diagnostic.test(sens = 0.99, delta = 0.14, power = 0.95) # 40
##
## Diagnostic test exact power calculation
##
## sens = 0.99
## n = 40
## n1 = 40
## delta = 0.14
## sig.level = 0.05
## power = 0.95
## prev = NULL
##
## NOTE: n is number of cases, n1 is number of controls
power.diagnostic.test(sens = 0.99, delta = 0.13, power = 0.95) # 43
##
## Diagnostic test exact power calculation
##
## sens = 0.99
## n = 43
## n1 = 43
## delta = 0.13
## sig.level = 0.05
## power = 0.95
## prev = NULL
##
## NOTE: n is number of cases, n1 is number of controls
power.diagnostic.test(sens = 0.99, delta = 0.12, power = 0.95) # 47
##
## Diagnostic test exact power calculation
##
## sens = 0.99
## n = 47
## n1 = 47
## delta = 0.12
## sig.level = 0.05
## power = 0.95
## prev = NULL
##
## NOTE: n is number of cases, n1 is number of controls
The sample size planning for developing binary classifiers in case of high dimensional data, we can apply function ssize.pcc, which is based on the probability of correct classification (PCC) (Dobbin and Simon 2007).
## see Table 2 of Dobbin et al. (2008)
g <- 0.1
fc <- 1.6
ssize.pcc(gamma = g, stdFC = fc, nrFeatures = 22000)
##
## Sample Size Planning for Developing Classifiers Using High Dimensional Data
##
## gamma = 0.1
## prev = 0.5
## nrFeatures = 22000
## n1 = 21
## n2 = 21
##
## NOTE: n1 is number of cases, n2 is number of controls
There are quite some discussions and various proposals, which test is the best under which circumstances when we want to compare the location (mean, median) of two groups (Fagerland and Sandvik 2009; Fagerland 2012; Sezer, Ozkip, and Yazici 2017). In addition, it is questioned whether pre-testing of assumptions is appropriate/useful at least from a practical point of view (Zimmerman 2004; Rasch, Kubinger, and Moder 2011; Rochon, Gondan, and Kieser 2012; Hoekstra, Kiers, and Johnson 2012).
We provide functions to simulate power and type-I-error of classical (Gosset 1908), Welch (Welch 1947) and Hsu (Hsu 1938) t-tests as well as of Wilcoxon-Mann-Whitney tests (Wilcoxon 1945; Mann and Whitney 1947).
## functions to simulate the data
## group 1
rx <- function(n) rnorm(n, mean = 0, sd = 1)
rx.H0 <- rx
## group 2
ry <- function(n) rnorm(n, mean = 3, sd = 3)
ry.H0 <- function(n) rnorm(n, mean = 0, sd = 3)
## theoretical results
power.welch.t.test(n = 10, delta = 3, sd1 = 1, sd2 = 3)
##
## Two-sample Welch t test power calculation
##
## n = 10
## delta = 3
## sd1 = 1
## sd2 = 3
## sig.level = 0.05
## power = 0.7794139
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.hsu.t.test(n = 10, delta = 3, sd1 = 1, sd2 = 3)
##
## Two-sample Hsu t test power calculation
##
## n = 10
## delta = 3
## sd1 = 1
## sd2 = 3
## sig.level = 0.05
## power = 0.7611553
## alternative = two.sided
##
## NOTE: n is number in *each* group
## simulation
res.t <- sim.power.t.test(nx = 10, rx = rx, rx.H0 = rx.H0,
ny = 10, ry = ry, ry.H0 = ry.H0)
res.t
##
## Simulation Set-up
## nx = 10
## rx = function (n) , rnorm(n, mean = 0, sd = 1)
## rx.H0 = function (n) , rnorm(n, mean = 0, sd = 1)
## ny = 10
## ry = function (n) , rnorm(n, mean = 3, sd = 3)
## ry.H0 = function (n) , rnorm(n, mean = 0, sd = 3)
## sig.level = 0.05
## mu = 0
## alternative = two.sided
## iter = 10000
##
## Classical Two-sample t-Test
## emp.power = 0.8082
## emp.type.I.error = 0.0548
##
## Welch Two-sample t-Test
## emp.power = 0.7789
## emp.type.I.error = 0.0476
##
## Hsu Two-sample t-Test
## emp.power = 0.7622
## emp.type.I.error = 0.0405
res.w <- sim.power.wilcox.test(nx = 10, rx = rx, rx.H0 = rx.H0,
ny = 10, ry = ry, ry.H0 = ry.H0)
res.w
##
## Simulation Set-up
## nx = 10
## rx = function (n) , rnorm(n, mean = 0, sd = 1)
## rx.H0 = function (n) , rnorm(n, mean = 0, sd = 1)
## ny = 10
## ry = function (n) , rnorm(n, mean = 3, sd = 3)
## ry.H0 = function (n) , rnorm(n, mean = 0, sd = 3)
## sig.level = 0.05
## mu = 0
## alternative = two.sided
## iter = 10000
## conf.int = FALSE
## approximate = FALSE
## ties = FALSE
##
## Exact Wilcoxon-Mann-Whitney Test
## emp.power = 0.7463
## emp.type.I.error = 0.0587
##
## Asymptotic Wilcoxon-Mann-Whitney Test
## emp.power = 0.7463
## emp.type.I.error = 0.0587
For further investigation of the results, we provide some diagnostic plots.
## t-tests
hist(res.t)
qqunif(res.t, alpha = 0.05)
volcano(res.t, hex = TRUE)
## Wilcoxon-Mann-Whitney tests
hist(res.w)
qqunif(res.w, alpha = 0.05)
We can also generate a volcano plot for the Wilcoxon-Mann-Whitney test, but this would require setting argument conf.int to TRUE, which would lead to a much higher computation time, hence we omitted it here. Furthermore, it is also possible to compute an approximate version of the test by setting argument approximate to TRUE (Hothorn et al. 2008) again by the cost of an increased computation time.
sessionInfo()
## R version 4.0.3 Patched (2020-10-12 r79333)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19.3
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /home/kohlm/RTOP/Rbranch/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MKpower_0.5
##
## loaded via a namespace (and not attached):
## [1] zoo_1.8-8 matrixTests_0.1.9 modeltools_0.2-23 tidyselect_1.1.0
## [5] xfun_0.19 coin_1.3-1 purrr_0.3.4 splines_4.0.3
## [9] lattice_0.20-41 colorspace_2.0-0 vctrs_0.3.5 generics_0.1.0
## [13] htmltools_0.5.0 stats4_4.0.3 yaml_2.2.1 gmp_0.6-1
## [17] survival_3.2-7 rlang_0.4.8 hexbin_1.28.1 pillar_1.4.6
## [21] glue_1.4.2 arrangements_1.1.9 MKinfer_0.6 matrixStats_0.57.0
## [25] multcomp_1.4-15 lifecycle_0.2.0 robustbase_0.93-6 stringr_1.4.0
## [29] munsell_0.5.0 gtable_0.3.0 mvtnorm_1.1-1 codetools_0.2-18
## [33] evaluate_0.14 labeling_0.4.2 knitr_1.30 parallel_4.0.3
## [37] DEoptimR_1.0-8 TH.data_1.0-10 scales_1.1.1 farver_2.0.3
## [41] ggplot2_3.3.2 digest_0.6.27 stringi_1.5.3 dplyr_1.0.2
## [45] qqplotr_0.0.4 grid_4.0.3 tools_4.0.3 sandwich_3.0-0
## [49] magrittr_2.0.1 tibble_3.0.4 MKdescr_0.6 crayon_1.3.4
## [53] pkgconfig_2.0.3 ellipsis_0.3.1 MASS_7.3-53 libcoin_1.0-6
## [57] Matrix_1.2-18 rmarkdown_2.5 R6_2.5.0 boot_1.3-25
## [61] nlme_3.1-150 compiler_4.0.3
Dobbin, K.K., and R.M. Simon. 2007. “Sample size planning for developing classifiers using high-dimensional DNA microarray data.” Biostatistics 8 (1): 101–17.
Fagerland, M. W. 2012. “t-tests, non-parametric tests, and large studies–a paradox of statistical practice?” BMC Med Res Methodol 12 (June): 78.
Fagerland, M. W., and L. Sandvik. 2009. “Performance of five two-sample location tests for skewed distributions with unequal variances.” Contemp Clin Trials 30 (5): 490–96.
Flahault, A., M. Cadilhac, and G. Thomas. 2005. “Sample size calculation should be performed for design accuracy in diagnostic test studies.” J Clin Epidemiol 58 (8): 859–62.
Fleiss, Joseph L., Bruce Levin, and Myunghee Cho Paik. 2003. Statistical Methods for Rates and Proportions. Hoboken, N.J: Wiley Series in Probability; Statistics.
Gosset, William Sealy. 1908. “The Probable Error of a Mean.” Biometrika 6 (1): 1–25.
Hedderich, Jürgen, and Lothar Sachs. 2018. Angewandte Statistik: Methodensammlung Mit R. Springer Berlin Heidelberg. doi:10.1007/978-3-662-56657-2.
Hoekstra, R., H. A. Kiers, and A. Johnson. 2012. “Are assumptions of well-known statistical techniques checked, and why (not)?” Front Psychol 3: 137.
Hothorn, Torsten, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. 2008. “Implementing a Class of Permutation Tests: The coin Package.” Journal of Statistical Software 28 (8): 1–23. doi:10.18637/jss.v028.i08.
Hsu, P. 1938. “Contribution to the Theory of ‘Student’s’ T-Test as Applied to the Problem of Two Samples.” Statistical Research Memoirs 2: 1–24.
Mann, H. B., and D. R. Whitney. 1947. “On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other.” Ann. Math. Statist. 18 (1). The Institute of Mathematical Statistics: 50–60. doi:10.1214/aoms/1177730491.
Rasch, Dieter, Klaus D. Kubinger, and Karl Moder. 2011. “The Two-Sample T Test: Pre-Testing Its Assumptions Does Not Pay Off.” Statistical Papers 52: 219–31.
Rochon, J., M. Gondan, and M. Kieser. 2012. “To test or not to test: Preliminary assessment of normality when comparing two independent samples.” BMC Med Res Methodol 12 (June): 81.
Sezer, Ahmet, Evren Ozkip, and Berna Yazici. 2017. “Comparison of Confidence Intervals for the Behrens-Fisher Problem.” Communications in Statistics - Simulation and Computation 46 (4). Taylor & Francis: 3242–66. doi:10.1080/03610918.2015.1082587.
Welch, B. L. 1947. “The Generalisation of Student’s Problems When Several Different Population Variances Are Involved.” Biometrika 34 (1-2): 28–35.
Wilcoxon, Frank. 1945. “Individual Comparisons by Ranking Methods.” Biometrics Bulletin 1 (6). [International Biometric Society, Wiley]: 80–83.
Zhu, H., and H. Lakkis. 2014. “Sample size calculation for comparing two negative binomial rates.” Stat Med 33 (3): 376–87.
Zimmerman, D. W. 2004. “A note on preliminary tests of equality of variances.” Br J Math Stat Psychol 57 (Pt 1): 173–81.