An introduction to loadr

Nathan Sheffield

2021-04-16

An introduction to loadr

Motivation

In a large, complex R analysis, I often find myself with dozens of resource data objects that I use as a reference for my computation. These are usually some kind of large dataset that I read into R and then use as reference data. As an example, I have a file with a big matrix with values from chromatin accessibility for 120 different cell-types across about a million different genomic regions (so, a 120-by-1 million matrix). I read this into my R sessions occasionally when I want to see, for example, how a new set of regions looks across that data resource. I’m not really changing the resource data, I’m just using it as a reference. I may load dozens of these kinds of resources into a single R session, and I want to re-use them across multiple scripts and multiple sessions, so I find myself reading them in a lot. I also start writing functions around these kinds of resources.

Given this complexity, sometimes it gets confusing for me to keep track of all the variables I keep in my primary R workspace, so I started to investigate how I could use R environments to declutter things. I discovered that I could load these kinds of “shared data reference” sets into a separate R environment, let’s call it SV for “Shared Variables” – and then they don’t clutter my primary R workspace and I can still access them with SV$variable. This cleaned things up a bit for me and so I’ve developed a series of functions that make it really easy to interact with reference data like this kept in its own separate environment. That’s what loadr is.

Before we start loading data, it’s important to tell loadr the name of the environment where I want to store these reference objects. loadr uses a global variable (LOADRENV) to keep track of this environment name and provides a getter/setter function (loadrEnv()) to change or retrieve this. By default, we’ll use an environment named SV. To make it explicit, just do this:

library(loadr)
loadrEnv("SV")

(This is optional, SV is the default environment name. You can also use this to name your shared variable environment something else). Now, let’s generate some random data and load it into that environment with loadr functions:

eload(list(sampleData = rnorm(1e7, 0,1)))
## Newly Loaded: sampleData

Notice that the eload function takes a named list. The name of item in the list becomes the variable name of the object. If you now look at the R objects in your workspace, you won’t see an object named sampleData, but you will see one named SV, which is a object of type environment:

ls()
## [1] "SV"
class(SV)
## [1] "environment"

The sampleData object has been put in the SV environment:

ls(envir=SV)
## [1] "sampleData"

We can get that sampleData object back like this:

head(SV$sampleData)
## [1]  0.034760674 -0.555494027  0.458113803 -0.001767163  0.553871776
## [6]  0.312388080

If you have multiple objects to load at the same time, no problem:

eload(list(x=5, y=7))
## Newly Loaded: x, y
## Unchanged: sampleData

The function has notified us that the x and y objects are newly loaded in to the shared variable environment. What happens if we try to load objects with the same name? They will get updated:

eload(list(x=526, y=234))
## Updated: x, y
## Unchanged: sampleData
SV$x
## [1] 526
SV$y
## [1] 234

In practice, I write functions that load up data and then I want to load up the results of these functions. Let’s define a function that will read data from a (potentially large) file:

loadMyData = function() {
    filePath = system.file("extdata", "mydata.txt", package="loadr")
    myData = read.table(filePath)
    return(myData)
}
eload(list(myvector=loadMyData()))
## Newly Loaded: myvector
## Unchanged: sampleData, x, y
SV$myvector
##     V1
## 1    2
## 2    3
## 3    5
## 4    2
## 5    3
## 6    3
## 7    4
## 8   62
## 9    2
## 10   3
## 11   4
## 12   6
## 13   3
## 14   4
## 15 564

Loading local variables or variably named variables with vload

What if I just want to load a local R variable, myLocalVar, under its current name? eload(myLocalVar) doesn’t work because there is no named list of variables, so that will fail. It’s possibe but a bit cumbersome to use eload(list(myLocalVar=myLocalVar)).

Here’s where an alias called vload() can help:

myLocalVar = 15
vload(myLocalVar)
## Newly Loaded: myLocalVar
## Unchanged: myvector, sampleData, x, y

vload can accept any number of variables:

myLocalVar2 = 22
myLocalVar3 = 31
vload(myLocalVar2, myLocalVar3)
## Newly Loaded: myLocalVar2, myLocalVar3
## Unchanged: myLocalVar, myvector, sampleData, x, y

You can also use vload with a special named argument varNames to assign the objects to variably named variables. For example:

myLocalVar4 = 15
vload(myLocalVar2, myLocalVar3, varNames=c("var1", "var2"))
## Newly Loaded: var1, var2
## Unchanged: myLocalVar, myLocalVar2, myLocalVar3, myvector, sampleData, x, y

To see how this could be useful, let’s imagine our data loading function depends on some variable which is script dependent, like the reference genome assembly:

loadMyData = function(genome) {
    filePath = system.file("extdata", "mydata.txt", package="loadr")
    myData = read.table(filePath)
    return(myData)
}

Let’s load up some data from this function for a few different genomes at the same time:

genome="hg19"
vload(loadMyData(genome), varNames=paste0("myvector_", genome))
## Newly Loaded: myvector_hg19
## Unchanged: myLocalVar, myLocalVar2, myLocalVar3, myvector, sampleData, var1, var2, x, y
genome="mm10"
vload(loadMyData(genome), varNames=paste0("myvector_", genome))
## Newly Loaded: myvector_mm10
## Unchanged: myLocalVar, myLocalVar2, myLocalVar3, myvector, myvector_hg19, sampleData, var1, var2, x, y

An easier way to accomplish the same thing that will be to just lapply across our list of genomes:

genomes=c("hg19", "mm10")
res = lapply(genomes, function(genome) { 
    vload(loadMyData(genome), varNames=paste0("myvector_", genome))
    })
## Updated: myvector_hg19
## Unchanged: myLocalVar, myLocalVar2, myLocalVar3, myvector, myvector_mm10, sampleData, var1, var2, x, y
## Updated: myvector_mm10
## Unchanged: myLocalVar, myLocalVar2, myLocalVar3, myvector, myvector_hg19, sampleData, var1, var2, x, y