Using future code in package is not much different from other type of package code or when using futures in R scripts. However, there are a few things that are useful to know about in order to minimize the risk for surprises to the end user.
The by far most common and popular future strategy is to parallelize on the local machine, i.e. plan(multisession)
. This is often good enough in most situations but note that some end-users have access to multiple machines and might want to run your code using all of them to speed it up beyond what a single machine can do. Because of this, avoid as far as possible making assumption about your code will only run on the local machine. A good “smell test” is to ask yourself:
- Will my future code work if it ends up running on the other side of the world?
Regardless of performance, if you answer “Yes”, you have already embraced the core philosophy of the future framework. If you answer “Maybe” or “No”, see if you can rewrite it.
For instance, if your future code made an assumption that it will have access to our local file system, as in:
f <- future({
data <- read_tsv(file)
analyze(data)
})
you can rewrite the code to load the content of the file before you set up the future, as in:
data <- read_tsv(file)
f <- future({
analyze(data)
})
Similarly, we should avoid having the future code write to the local file system because the parent R session might not have access to that file system.
By keeping the future smell test in mind when writing future code, we increase the chances that the code can be parallelized in more ways than on just the local computer. Properly written future code will work regardless of what future strategy the end-user picks, e.g.
plan(sequential)
plan(multisession)
plan(cluster, workers = rep(c("n1.remote.org", "n2.remote.org", "n3.remote.org"), each = 32))
Remember, as developers we never know what compute resources the end-user has access to right now or they will have access to in six month. Who knows, your code might even end up running on 2,000 cores located on The Moon twenty years from now.
For reasons like the ones mentioned above, refrain from setting plan()
in a function. It is better to leave it to the end-user to decided how they want to parallelize. One reason for this is that we can never know how and in what context our code will run. For example, they might use futures to parallelize a function call in some other package and that package code calls your package internally. If you set plan(multisession)
internally without undoing, you might mess up the plan()
that is already set breaking any further parallelization.
If you still think it is necessary to set plan()
, make sure to undo when the function exits, also on errors. This can be done by using on.exit()
, e.g.
my_fcn <- function(x) {
oplan <- plan(multisession)
on.exit(plan(oplan))
y <- analyze(x)
summarize(y)
}
The need for setting the future strategy within a function often comes from developers wanting to add an argument to their function that allows the end-user to specify whether they want to run the function in parallel or sequentially. This often result in code like:
my_fcn <- function(x, parallel = FALSE) {
if (parallel) {
oplan <- plan(multisession)
on.exit(plan(oplan))
y <- future_lapply(x, FUN = analyze) ## from future.apply package
} else {
y <- lapply(x, FUN = analyze)
}
summarize(y)
}
This way the user can use:
y <- my_fcn(x, parallel = FALSE)
or
y <- my_fcn(x, parallel = TRUE)
depending on their needs. However, if another package developer decide to call you function in their function, they now have to expose that parallel
argument to the users of their function, e.g.
their_fcn <- function(x, parallel = FALSE) {
x2 <- preprocess(x)
y <- my_fcn(x2, parallel = parallel)
z <- another_fcn(y)
z
}
Exposing and passing a “parallel” argument along can become quite cumbersome. Instead, it is neater to use:
my_fcn <- function(x) {
y <- future_lapply(x, FUN = analyze) ## from future.apply package
summarize(y)
}
and let the user control whether or not they want to parallelize via plan()
, e.g. plan(multisession)
and plan(sequential)
.
If your example sets the future strategy at the beginning, make sure to reset the future strategy to plan(sequential)
at the end of the example. The reason for this is that when switching plan, the previous one will be cleaned up. This is particularly important for multisession and cluster futures where plan(sequential)
will shut down the underlying PSOCK clusters.
For instance, here is an example:
## Run the analysis in parallel on the local computer
future::plan("multisession")
y <- analyze("path/to/file.csv")
## Shut down parallel workers
future::plan("sequential")
If you forget to shut down the PSOCK cluster, then R CMD check --as-cran
, or R CMD check
with environment variable _R_CHECK_CONNECTIONS_LEFT_OPEN_=true
set, will produce an error on
$ R CMD check --as-cran mypkg_1.0.tar.gz
...
* checking examples ... ERROR
Running examples in 'analyze-Ex.R' failed
...
> cleanEx()
Error: connections left open:
<-localhost:37400 (sockconn)
<-localhost:37400 (sockconn)
Execution halted
If you for some reason do not like to display reset of the future strategy in the help documentation, but you still want it run, wrap the statement in an Rd \dontshow{}
statement.
If you want to make sure your code works when running sequentially as well as when running in parallel, it is often good enough to have package tests that run the code with:
plan(multisession)
If the code works with this setup, you can be sure that all global variables are properly identified and exported to the workers and that the required packages are loaded on the workers.
If not all of your tests are written this way, you can set environment variable R_FUTURE_PLAN=multisession
before you call R CMD check
. This will make the default future strategy to become 'multisession' instead of 'sequential'. For example,
$ export R_FUTURE_PLAN=multisession
$ R CMD check --as-cran mypkg_1.0.tar.gz