I discuss how to use the oops
package in R to create an artificial neural network by hand, then train the network to forecast price inflation. This is for illustrative purposes and is not intended as an optimal method of either building a neural network or forecasting inflation.
Most discussions of artificial neural networks start with a graph of interconnected circles to demonstrate how the network connects data inputs to output through different “hidden layers”. Those graphs are overly complex and obfuscate the network’s internal structure. I think that the best way to start is to think about a regression. A regression takes a set of independent variables and relates it to another dependent variable through a specified function, such as relating height, exercise, and eating habits to weight. Mathematically, this is
\[\widehat{y} = f(\beta\cdot X)\]
where \(X\) is a matrix of independent variables, \(\beta\) is a vector of regression coefficients that describe the importance of each variable, \(f(\bullet)\) is the function, and \(\widehat{y}\) is the prediction of the dependent variable. The simple linear regression that plots the best-fitting, straight line through a series of data point just uses the linear function \(f(x) = x\). A regression typically solves for \(\beta\) by minimizing some loss function, commonly the sum of squared prediction error for each observation - \(\sum_i \left(y_i - \widehat{y}_i\right)^2\).
Neural networks expand the idea of regression by nesting multiple functions and sets of coefficient weights.
\[\widehat{y} = f_n \left(\dots f_2\left(\beta_2\cdot f_1\left(\beta_1\cdot X\right)\right)\right)\]
Each of the \(n\) functions above and their weights are the network’s layers, \(n-1\) of which are often considered “hidden”. The nested functions add nonlinearity and allow complex relationships to exist relative to standard regressions. Note that this simplifies to a single, linear function if all \(f\) are linear.
The different \(\beta\)’s are estimated by starting with an original guess, then feeding the input matrix \(X\) forward through each function to end up with a prediction. The prediction for each observation is compared to its original value to determine the error. The error is then used with the gradient of each function, starting with the last function, to determine how to update each \(\beta\). This is called backpropagation and the process repeats until the error is sufficiently small or the maximum number of iterations are reached.
To implement a neural network, we will use the ‘oops’ library in R to create a Layer Class. Class()
creates a function that generates Layer
objects. The argument "Layer"
gives each instance the S3 class name “Layer” to use with the different methods.
library(oops)
<- oClass("Layer") Layer
Each layer will have a set of weights, or \(\beta\) in previous examples. The dimensions of beta will vary for each instance of the layer, so they will be provided when created. Once the dimensions are known, beta can be initialized to set of random numbers. Though we will initialize beta to a specific set of values in this example to ensure that the vignette is reproducible. We will also keep track of the output in each layer for backpropagation.
We will create an init
method for “Layer” objects that takes the dimensions and populates the object fields accordingly. init()
is automatically called on each new oops
Class instance.
<- function(self, dim){
init.Layer $dim <- dim
self$beta <- matrix(rep(0.01, prod(dim)), nrow = dim[1])
self$output <- 0
selfreturn(self)
}
Each layer is also associated with a particular activation function and its derivative. We could include this in init
function above, but another solution is to create a different Class
associated with each function. As long as the new class inherits our Layer
Class, the init()
function will continue to work.
Most neural network layers use the same activation function in all layers. However, this exercise is primarily for demonstration purpose; so we will use three different functions.
<- oClass(
Layer_Relu inherit = Layer,
active_fun = function(x) ifelse(x <= 0, 0, x),
deriv_fun = function(x) ifelse(x <= 0, 0, 1)
)#> Warning in create_oclass(dots, name = name, inherit = inherit, portable
#> = portable, : Classes should be named to ensure that methods are properly
#> deployed.
<- oClass(
Layer_Gcu inherit = Layer,
active_fun = function(x) x*cos(x),
deriv_fun = function(x) cos(x) - x*sin(x)
)#> Warning in create_oclass(dots, name = name, inherit = inherit, portable
#> = portable, : Classes should be named to ensure that methods are properly
#> deployed.
<- oClass(
Layer_Softplus inherit = Layer,
active_fun = function(x) log(1+exp(x)),
deriv_fun = function(x) 1/(1+exp(-x))
)#> Warning in create_oclass(dots, name = name, inherit = inherit, portable
#> = portable, : Classes should be named to ensure that methods are properly
#> deployed.
Note the difference between each of the Class
objects above and the original Layer
class. First, I did not provide a name for each class, which issued a warning. This is ok since I do not need to distinguish between the individual layer types - this can be done by examining the activation function if necessary. Also, each new class inherits from Layer
, meaning that “Layer” methods will already deploy on each new instance.
class( Layer_Relu(1) )
#> [1] "Layer" "Instance"
ls( Layer_Relu(1) )
#> [1] "beta" "dim" "output"
The other difference is that the values for each function are provided at the time of Class creation. Any named value passed to Class()
is stored as field that all instances inherit. For example, all instances created by Layer_Relu
can automatically access the active_fun
field, though they do not directly store it. This is why active_fun
did not show up when ls()
was used above.
Now that we have our layers, let us build a Class for our neural network.
<- oClass("NNetwork") NNetwork
Each network needs to initialize with a dependent \(y\) variable and a matrix of independent \(X\) variables. It also needs to store the layers - we will use one of each of the three we created. Finally, we will train the network by fitting our beta weights to the data.
<- function(self, y, x){
init.NNetwork $y <- y
self$X <- X
self$layers <- list(
selfLayer_Gcu( dim=c(ncol(X), 4) ),
Layer_Relu( dim=c(4, 2) ),
Layer_Softplus( dim=c(2,1) )
)train_nn(self)
}
Fitting a neural network involves two steps: feeding the data through each of the layers, then propagating the error backwards through each layer to adjust the weights. This is repeated until the change in the error is sufficiently small or a number of iterations, or epochs, has been reached. So, let us add some starting values to our neural network Class to store these values, and some additional control variables.
$mse <- Inf
NNetwork$epoch <- 0
NNetwork$tolerance <- 1e-08
NNetwork$max_epoch <- 5000
NNetwork$learn_rate <- 0.2 NNetwork
One of the included control variables is the learning rate. This determines how much we will change our beta weights in each epoch. If this value is too high, our betas may oscillate around their optimal values without reaching them. If it is too low, then the number of epochs will grow without limit (or hit the maximum) and we may never reach the best set of values.
Next, let us create our train_nn
function.
<- function(nn){
train_nn while(TRUE) {
$epoch <- nn$epoch + 1
nnif (nn$epoch >= nn$max_epoch) break
<- feed_forward(nn) # returns output of last layer
out <- back_propagate(nn, out) # returns change in mse
diff if (diff < nn$tolerance){
$flag <- 1
nnbreak
}
}
nn }
To feed forward, we start with our input X
and apply it to each layer’s activation function sequentially, keeping track of the output at each step for adjustment purposes later. We will return the final output so that we can easily pass it into the next backpropagation step.
<- function(nn){
feed_forward <- nn$X
output for (layer in nn$layers){
<- layer$active_fun(output %*% layer$beta)
output $output <- output
layer
}
output }
Feeding the data through the layers is the easy part. Changing the weights is more difficult. We use a method called gradient descent, which takes the derivative of the activation function in each layer and moves along it depending the relative size of the error.
For the algorithm, we start at the last layer and calculate the loss gradient, \(\delta\). Let \(\epsilon = \left(y - \widehat{y}\right)\) be the total prediction error and \(f_i^\prime\left(z_i\right)\) be the derivative of the activation function of layer \(i\) applied to its output. Finally, let \(\cdot\) represent element-wise multiplication, while \(\times\) is matrix multiplication.
For the last layer,
\[\delta_L = \epsilon\cdot f_L^\prime\left(z_L\right)\]
The loss gradient for each other layer is calculated recursively by
\[\delta_i = \left[\beta_{i+1}^T \times \delta_{i+1}\right]\cdot f_i^\prime\left(z_i\right)\] Finally, our change in weights for each layer is determined by multiplying the output from the previous layer by the loss gradient. The initial layer uses the original input matrix.
\[\Delta \beta_i = -z_{i-1}^T \times \delta_i\] Since we are calculating the error in this step, we might as well calculate the mean square error and return how it is changing at each iteration.
<- function(nn, output){
back_propagate <- output - nn$y
error <- 2*error/length(output) # Loss = sum(error^2)/n => dLoss = 2/n*error
delta <- length(nn$layers)
nlayer
for (i in nlayer:1){
<- nn$layers[[i]]
layeri if (i == nlayer){
<- layeri$deriv_fun(output) * delta
delta else {
} <- (delta %*% t(nn$layers[[i+1]]$beta)) *
delta $deriv_fun(layeri$output)
layeri
}if (i == 1){
<- t(nn$X) %*% delta
dweight else {
} <- t(nn$layers[[i-1]]$output) %*% delta
dweight
}$beta <- layeri$beta - nn$learn_rate * dweight
layeri
}<- mean(error^2)
mse <- abs(nn$mse - mse)
dmse $mse <- mse
nnreturn(dmse)
}
All we have left now is our prediction function. This will look very similar to the feed_forward()
function defined above, except that the output at each layer is not saved.
<- function(nn, X){
predict_nn <- X
output for (layer in nn$layers){
<- layer$active_fun(output %*% layer$beta)
output
}
output }
We need some data to apply our artificial neural network. The data is available using data(cpi_data)
in the oops
package, but this will demonstrate how to construct the data set as well. We will use the eFRED
package to download the consumer price index.
library(eFRED)
library(dplyr)
<- fred("CPIAUCSL") dat
To calculate annual inflation, we find the log difference of CPI and its 12 month lag. We can also calculate several past values of price inflation to use as our independent variables.
<- dat %>% transmute(
cpi_data
date,pi = log(CPIAUCSL) - log(lag(CPIAUCSL, 12)),
pi.1 = lag(pi, 1),
pi.2 = lag(pi, 2),
pi.3 = lag(pi, 3),
pi.6 = lag(pi, 6),
pi.12 = lag(pi, 12)
%>%
) na.omit
Now, let us split the dataset into training and test sets that will be used for fitting and evaluating the network accordingly.
<- cpi_data[which(cpi_data$date <= as.Date("2019-12-01")),]
train <- cpi_data[which(cpi_data$date > as.Date("2019-12-01") & cpi_data$date < as.Date("2021-01-01")),] test
Now we train an artificial neural network with our training data.
<- as.matrix(train$pi, ncol=1)
y <- cbind(1, as.matrix(train[,3:7]))
X
<- NNetwork(y, X) nn
Note that nn
is an oops
Class object holding all of the network information. We can check the number iterations and the mean squared error accordingly. We can also check the weights for each layer, though these can be difficult to interpret compared to linear regression coefficients.
$epoch
nn#> [1] 4337
$mse
nn#> [1] 0.0001611906
$layers[[1]]$beta
nn#> [,1] [,2] [,3] [,4]
#> 0.4329761 0.4329761 0.4329761 0.4329761
#> pi.1 -0.4688054 -0.4688054 -0.4688054 -0.4688054
#> pi.2 -0.4537249 -0.4537249 -0.4537249 -0.4537249
#> pi.3 -0.4372365 -0.4372365 -0.4372365 -0.4372365
#> pi.6 -0.3863754 -0.3863754 -0.3863754 -0.3863754
#> pi.12 -0.2635620 -0.2635620 -0.2635620 -0.2635620
To predict the level of price inflation in the next period, we take the last columns of our testing data set and feed it into the neural network. Note that this approach assumes that we are only predicting one period ahead since we are using all of the most recent data for each observation.
<- cbind(1, as.matrix(test[, 3:7]))
Xtest <- predict_nn(nn, Xtest)
prediction
prediction#> [,1]
#> 877 0.02310848
#> 878 0.02360990
#> 879 0.02400029
#> 880 0.02346047
#> 881 0.02183437
#> 882 0.02040416
#> 883 0.02007325
#> 884 0.02045291
#> 885 0.02074300
#> 886 0.02053071
#> 887 0.02066337
#> 888 0.02092708
mean(prediction - test$pi) # Average error
#> [1] 0.009243088
Note that an ordinary least squares regression actually performs better than the previous value. Artificial neural networks do not always provide superior forecasts to less sophisticated methods. A lot of tuning of model parameters and the number of types of layers are necessary to produce optimal forecasts.
<- lm(y~-1+X, data = train)
ols mean(Xtest %*% as.matrix(ols$coeff) - test$pi)
#> [1] 0.0009104316