library(fRLR)
This R package aims to fit Repeated Linear Regressions in which there are some same terms.
Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as \(y\), and these regressions are as follows.
\[ \begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array} \]
where \(cov_i, i=1,\ldots, m\) are the same variables among these regressions.
Intuitively, we can finish this task by using a simple loop.
= vector(mode='list', length=n)
model for (i in 1:n)
{
...= lm(y~x)
model[[i]]
... }
However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter \(\beta\). And we have
\[ \hat\beta = (X'X)^{-1}X'Y \]
where \(X\) is the design matrix and \(Y\) is the observation of response variable.
It is obvious that there are some same elements in the design matrix, and the larger \(m\) is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.
For the above example, the design matrix can be denoted as \(X=(x, cov)\). If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in \(cov\) naturally. Then we have
\[ (X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right] \]
Woodbury formula tells us
\[ (A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]
Let
\[ A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right] \]
and \(C=I_{2\times 2}\). Then we can apply woodbury formula,
\[ (X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]
where
\[ A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right] \]
We can do further calculations to simplify and obtain the following result
\[ (X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right] \]
Notice that matrix \(B\) is the same for all regression, the identical terms for each regression are just \(a\) and \(v\), which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.
Now test two simulation examples by using the functions in this package.
## use fRLR package
set.seed(123)
= matrix(rnorm(50), 10, 5)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV frlr1(X, Y, COV)
#> r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983
## use simple loop
= matrix(nrow = 0, ncol = 2)
res for (i in 1:ncol(X))
{= cbind(X[,i], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(i, summary(model)$coefficients[2, 4])
tmp = rbind(res, tmp)
res
}
res#> [,1] [,2]
#> tmp 1 0.4380128
#> tmp 2 0.7791076
#> tmp 3 0.2212869
#> tmp 4 0.9495018
#> tmp 5 0.6729983
As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.
Similarly, we can test another example
set.seed(123)
= matrix(rnorm(50), 10, 5)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV
= c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx1 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
idx2
frlr2(X, idx1, idx2, Y, COV)
#> r1 r2 r1.p.value r2.p.value
#> 1 1 2 0.53021406 0.895719578
#> 2 2 3 0.01812006 0.009833047
#> 3 3 4 0.29895922 0.963995969
#> 4 4 5 0.91749181 0.712075464
#> 5 1 3 0.33761507 0.210331456
#> 6 1 4 0.51074586 0.966484642
#> 7 1 5 0.12479380 0.152802911
#> 8 2 4 0.79302893 0.902402294
#> 9 2 5 0.73153760 0.663392258
#> 10 3 5 0.32367303 0.877154122
#> 11 2 3 0.01812006 0.009833047
#> 12 1 2 0.53021406 0.895719578
#> 13 3 4 0.29895922 0.963995969
#> 14 4 5 0.91749181 0.712075464
#> 15 1 3 0.33761507 0.210331456
#> 16 1 4 0.51074586 0.966484642
#> 17 1 5 0.12479380 0.152802911
#> 18 2 4 0.79302893 0.902402294
#> 19 2 5 0.73153760 0.663392258
#> 20 3 5 0.32367303 0.877154122
= matrix(nrow=0, ncol=4)
res for (i in 1:length(idx1))
{= cbind(X[, idx1[i]], X[,idx2[i]], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
tmp = rbind(res, tmp)
res }
Again, we obtain the same results by different methods.
The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method.
We can obtain the following time cost for \(99\times 100/2=4950\) linear regressions.
set.seed(123)
= 100
n = matrix(rnorm(10*n), 10, n)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV
#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
= combn(n, 2)
id = id[1, ]
idx1 = id[2, ]
idx2
system.time(frlr2(X, idx1, idx2, Y, COV))
#> user system elapsed
#> 2.048 0.109 0.965
<- function()
simpleLoop
{= matrix(nrow=0, ncol=4)
res for (i in 1:length(idx1))
{= cbind(X[, idx1[i]], X[,idx2[i]], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
tmp = rbind(res, tmp)
res
}
}
system.time(simpleLoop())
#> user system elapsed
#> 9.275 0.011 9.413