The goal of outerbase
is to make the production of
near-interpolators easy, stable, and scalable. It is based on a
C++
backend and interfaced with R
via
{Rcpp}
. Under the hood, it leverages unique, custom linear
algebra using {RcppArmadillo}
(Armadillo) and omp
.
The overall structure is designed to be interacted with in an
object-oriented manner using Rcpp
modules.
There are ways to interact with outerbase
for those who
are uncomfortable with (or just actively detest) object oriented
programming.
To begin, load the package.
library(outerbase)
Note that if you built this package from source, make sure you use a
compiler that can process omp
commands to access the entire
speed benefits.
To understand how to get started with using the package, we predict
using data from a function with an eight dimensional input commonly
known as the Borehole function.
We begin by generating 1000
points using a test function
?obtest_borehole8d
built into the package.
= 400
sampsize = 8
d = matrix(runif(sampsize*d),ncol=d) #uniform samples
x = obtest_borehole8d(x) + 0.5*rnorm(sampsize) y
Our goal will be to design a predictor for y
given
x
that is a near-interpolator.
The simplest way to interact with this package is using
?obfit
(fitting outerbase). The function requires
x
, y
and two other objects.
The value of numb
is the number of basis functions you
want to use. The choice of numb
is still under research,
but generally you want it to be large as tolerable. Play around!
The underlying concepts of this approach come from Gaussian
processes. Thus the core building block of predictors will be
covariances functions. The choice of the covariances needed for
obfit
is a list of strings corresponding to each column in
x
. Type listcov()
to discover what covariance
functions are are currently deployed.
listcov()
#> [1] "mat25pow" "mat25" "mat25ang"
If you are curious about any of them, type,
e.g. ?covf_mat25pow
.
Note obfit
has some checks in place to prevent serious
damage. They are not foolproof.
= obfit(x, y, covnames=rep("elephant",8))
obmodel #> Error in .checkcov(covnames[k], x[, k]):
#> covariances must be from listcov()
= obfit(x, y[1:200], covnames=rep("mat25pow",5))
obmodel #> Error in obfit(x, y[1:200], covnames = rep("mat25pow", 5)):
#> x and y dims do not align
= obfit(x[1:2,], y[1:2], covnames=rep("mat25pow",8))
obmodel #> Error in obfit(x[1:2, ], y[1:2], covnames = rep("mat25pow", 8)):
#> dimension larger than sample size has not been tested
= obfit(x, y, numb = 2, covnames=rep("mat25pow",8))
obmodel #> Error in obfit(x, y, numb = 2, covnames = rep("mat25pow", 8)):
#> number of basis functions should be less than twice the dimension
= obfit(100*x, y, covnames=rep("mat25pow",8))
obmodel #> Error in .checkcov(covnames[k], x[, k]):
#> x ranges exceed limits of covariance functions
#> the limits are between 0 and 1
#> try rescaling
= obfit(0.001*x, y, covnames=rep("mat25pow",8))
obmodel #> Error in .checkcov(covnames[k], x[, k]):
#> x are too small for ranges
#> the limits are between 0 and 1
#> try rescaling
Below is one correct deployment, where mat25pow
is used
for all dimensions. This should take a bit to run, but it should be
around a second on most modern computers.
= proc.time()
ptm = obfit(x, y, numb=300, covnames=rep("mat25pow",8),
obmodel verbose = 3)
#> doing partial optimization
#> max number of cg steps set to 100
#>
#> ########started BFGS#######
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 0 220.745 NA NA 0.1
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 1 -296.9 -517.588 -133.116 0.4
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 2 -804.38 -507.41 -5558.19 0.109596
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 3 -1106.1 -301.688 -212.856 0.546858
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 4 -1177 -70.8663 -1377.63 0.29044
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 5 -1250.25 -73.245 -141.959 0.328663
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 6 -1297.93 -47.6694 -68.8574 0.367347
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 7 -1320.23 -22.3002 -24.9251 0.40604
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 8 -1336.55 -16.3209 -11.4459 0.444336
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 9 -1353.65 -17.0936 -12.776 0.481882
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 10 -1367.47 -13.8251 -14.6767 0.518378
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 11 -1374.04 -6.5679 -10.1321 0.553582
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 12 -1376.13 -2.08782 -13.956 0.587305
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 13 -1378.24 -2.10842 -3.1903 0.619409
#> num iter: 14 obj start: 220.7445 obj end: -1378.609
#> final learning rate: 0.6497998
#> approx lower bound (not achieved): -1379.679
#> #########finished BFGS########
#>
#> doing optimization 1
#> max number of cg steps set to 315
#>
#> ########started BFGS#######
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 0 -1379.31 NA NA 0.3249
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 1 -1381.85 -2.53389 -1.74958 0.3249
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 2 -1384.84 -2.99056 -3.72161 0.363559
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 3 -1386.12 -1.28781 -2.39216 0.40227
#> num iter: 4 obj start: -1379.311 obj end: -1386.528
#> final learning rate: 0.4406213
#> approx lower bound (not achieved): -1387.874
#> #########finished BFGS########
#>
#> doing optimization 2
#> max number of cg steps set to 316
#>
#> ########started BFGS#######
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 0 -1386.34 NA NA 0.220311
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 1 -1386.37 -0.0320092 -0.0255303 0.220311
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 2 -1386.39 -0.0267811 -0.0183269 0.25629
#> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#> 3 -1386.42 -0.0240469 -0.0264348 0.293669
#> num iter: 4 obj start: -1386.336 obj end: -1386.43
#> final learning rate: 0.3319501
#> approx lower bound (not achieved): -1386.475
#> #########finished BFGS########
print((proc.time() - ptm)[3])
#> elapsed
#> 0.59
Note that the package is made using custom parallelization at the
linear-algebra level. The package relies on omp
for
parallelization, so if the package was not compiled with that in place
there will be no benefits. The default call of obfit
grabs
all available threads, which is ideal for desktops/laptops. It might be
less ideal for large clusters where the CPU might be shared.
We can adjust the number of threads manually. Below we reduce ourselves to a single thread, which should slow things down.
= proc.time()
ptm = obfit(x, y, numb=300, covnames=rep("mat25pow",8),
obmodel nthreads=1) #optional input
print((proc.time() - ptm)[3])
#> elapsed
#> 1.21
We can then predict using ?obpred
. While it is not an
exact interpolator, it is close.
= obpred(obmodel, x)
predtr = sqrt(mean((y-predtr$mean)^2))
rmsetr plot(predtr$mean, y,
main=paste("training \n RMSE = ", round(rmsetr,3)),
xlab="prediction", ylab = "actual")
Since we generated this data, we can show that outerbase
can reasonably predict ground truth, meaning overfitting is not an
issue.
= obtest_borehole8d(x)
ytrue = sqrt(mean((ytrue-predtr$mean)^2))
rmsetr plot(predtr$mean, ytrue,
main=paste("oracle \n RMSE = ", round(rmsetr,3)),
xlab="prediction", ylab="actual")
1000
test points generated the same way as our original
data can also serve as a verification process.
= matrix(runif(1000*d),ncol=d) #prediction points
xtest = obtest_borehole8d(xtest) + 0.5*rnorm(1000) ytest
The predictions at these new points are also quite good. Not quite as good as the residuals on the test set, but we are extrapolating here.
= obpred(obmodel, xtest)
predtest
= sqrt(mean((ytest-predtest$mean)^2))
rmsetst plot(predtest$mean, ytest,
main=paste("testing \n RMSE = ", round(rmsetst,3)),
xlab="prediction", ylab="actual")
This package also produces variances on the predictions which we can use to test reasonableness. The fact that the second histogram looks like a standard Normal is promising that the predictions are reasonable.
hist((ytest-predtest$mean),
main="testing \n residuals", xlab="residuals")
hist((ytest-predtest$mean)/sqrt(predtest$var),
main="testing \n standarized residuals",
xlab="standarized residuals")