library(simpr)
simpr
is designed with reproducibility in mind. If you set the same seed, you get the same results.
set.seed(500)
= specify(a = ~ runif(6)) %>%
run_1 generate(3)
run_1#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 1]>
#> 2 2 2 <tibble [6 × 1]>
#> 3 3 3 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.574
#> 2 0.135
#> 3 0.660
#> 4 0.0854
#> 5 0.308
#> 6 0.00362
set.seed(500)
= specify(a = ~ runif(6)) %>%
run_2 generate(3)
run_2#> full tibble
#> --------------------------
#> # A tibble: 3 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 1]>
#> 2 2 2 <tibble [6 × 1]>
#> 3 3 3 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.574
#> 2 0.135
#> 3 0.660
#> 4 0.0854
#> 5 0.308
#> 6 0.00362
identical(run_1, run_2)
#> [1] TRUE
What’s more, generate()
can take filtering criteria, so that you can re-generate specific repetitions or conditions without having to recreate the entire simulation. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce.
set.seed(500)
= specify(a = ~ runif(6)) %>%
filter_after_generating generate(3) %>%
filter(.sim_id == 2)
filter_after_generating#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 2 2 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.895
#> 2 0.766
#> 3 0.607
#> 4 0.199
#> 5 0.0969
#> 6 0.247
## Much faster, same result!
set.seed(500)
= specify(a = ~ runif(6)) %>%
filter_while_generating generate(3, .sim_id == 2)
filter_while_generating#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 2 2 <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.895
#> 2 0.766
#> 3 0.607
#> 4 0.199
#> 5 0.0969
#> 6 0.247
identical(filter_after_generating, filter_while_generating)
#> [1] TRUE
Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.
A common use case is for regenerating the data in cases where an error was created. Here’s an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.
set.seed(500)
= specify(a = ~ sample(0:max, size = 10, replace = TRUE),
fit_tidy b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3) %>%
fit(lm = ~ glm(a ~ b, family = "binomial")) %>%
tidy_fits()
#> Warning in fit.simpr_tibble(., lm = ~glm(a ~ b, family = "binomial")): fit()
#> produced errors. See '.fit_error_*' column(s).
fit_tidy#> # A tibble: 15 × 10
#> .sim_id max rep Source .fit_error term estimate std.error statistic
#> <int> <dbl> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 0 1 lm <NA> (Int… -2.46e+ 1 47403. -5.18e- 4
#> 2 1 0 1 lm <NA> b -4.41e-15 34288. -1.29e-19
#> 3 2 1 1 lm <NA> (Int… -6.27e- 1 0.987 -6.35e- 1
#> 4 2 1 1 lm <NA> b 2.25e+ 0 1.56 1.45e+ 0
#> 5 3 10 1 lm "Error in eva… <NA> NA NA NA
#> 6 4 0 2 lm <NA> (Int… -2.46e+ 1 41782. -5.88e- 4
#> 7 4 0 2 lm <NA> b 1.86e-15 42381. 4.39e-20
#> 8 5 1 2 lm <NA> (Int… -1.34e+ 0 0.952 -1.41e+ 0
#> 9 5 1 2 lm <NA> b 7.17e- 1 0.679 1.06e+ 0
#> 10 6 10 2 lm "Error in eva… <NA> NA NA NA
#> 11 7 0 3 lm <NA> (Int… -2.46e+ 1 49990. -4.91e- 4
#> 12 7 0 3 lm <NA> b -5.84e-17 47411. -1.23e-21
#> 13 8 1 3 lm <NA> (Int… -1.72e- 1 1.03 -1.67e- 1
#> 14 8 1 3 lm <NA> b 6.76e- 1 0.971 6.96e- 1
#> 15 9 10 3 lm "Error in eva… <NA> NA NA NA
#> # … with 1 more variable: p.value <dbl>
One options for regenerating is to filter directly to the problematic max == 10
condition to examine the generated data.
set.seed(500)
= specify(a = ~ sample(0:max, size = 10, replace = TRUE),
filter_max_10 b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3, max == 10)
filter_max_10#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#> .sim_id max rep sim
#> <int> <dbl> <int> <list>
#> 1 3 10 1 <tibble [10 × 2]>
#> 2 6 10 2 <tibble [10 × 2]>
#> 3 9 10 3 <tibble [10 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#> a b
#> <int> <dbl>
#> 1 6 6.28
#> 2 9 8.39
#> 3 2 1.67
#> 4 2 3.73
#> 5 1 0.857
#> 6 9 9.09
#> 7 3 2.10
#> 8 9 10.5
#> 9 10 11.4
#> 10 0 -1.73
Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.
In general, we could also filter down to only values of .sim_id
which generated errors to examine those:
= filter(fit_tidy, !is.na(.fit_error))
fit_errors
set.seed(500)
= specify(a = ~ sample(1:max, size = 10, replace = TRUE),
fit_error_data b = ~ a + rnorm(10)) %>%
define(max = c(0, 1, 10)) %>%
generate(3, .sim_id %in% fit_errors$.sim_id)
fit_error_data#> full tibble
#> --------------------------
#> # A tibble: 3 × 4
#> .sim_id max rep sim
#> <int> <dbl> <int> <list>
#> 1 3 10 1 <tibble [10 × 2]>
#> 2 6 10 2 <tibble [10 × 2]>
#> 3 9 10 3 <tibble [10 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 2
#> a b
#> <int> <dbl>
#> 1 7 7.56
#> 2 10 7.68
#> 3 3 2.60
#> 4 3 3.52
#> 5 2 -0.137
#> 6 10 9.83
#> 7 4 3.51
#> 8 10 8.73
#> 9 1 0.317
#> 10 8 8.66
This approach is useful in cases where we don’t know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.