library(lgpr)
#> Attached lgpr 1.1.5, using rstan 2.21.2. Type ?lgpr to get started.
This vignette describes mathematically the statistical models of lgpr
. We study the different arguments of the lgp()
or create_model()
modeling functions and what parts of the probabilistic model they customize. This is a concise description, and the original publication (Timonen et al. (2021)) has more information about the actual motivation for the used modeling approaches, and the tutorials have code examples.
The models in lgpr
are models for the conditional distribution \[
p(y \mid f(\textbf{x}), \theta_{\text{obs}}),
\] of response variable \(y\) given covariates \(\textbf{x}\), where \(\theta_{\text{obs}}\) is a possible parameter of the observation model (like the magnitude of observation noise). The function \(f\) has a Gaussian Process (GP) prior \[
f \sim \mathcal{GP}(0, k\left(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}})\right),
\]
with covariance (kernel) function \(k(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}})\) that has hyperparameters \(\theta_{\text{GP}}\). In addition to the GP prior for \(f\), there is a parameter prior distribution \(p(\theta)\) for \(\theta = \left\{ \theta_{\text{GP}}, \theta_{\text{obs}} \right\}\). Given \(N\) observations \(\mathcal{D} = \{y_n, \textbf{x}_n\}_{n=1}^N\) the probabilistic models in lgpr
have the form \[\begin{align}
p\left(\theta, \textbf{f}\right) &= p\left(\textbf{f} \mid \theta\right) \cdot p(\theta) & \text{(prior)} \\
p(\textbf{y} \mid \textbf{f}, \theta) &= \prod_{n=1}^N p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}) & \text{(likelihood)},
\end{align}\] where \(\textbf{f} = \left[ f(\textbf{x}_1), \ldots, f(\textbf{x}_N) \right]^{\top}\), \(\textbf{y} = \left[y_1, \ldots, y_N\right]^{\top}\). The parameter prior density \(p(\theta)\) is the product of the prior densities of each parameter, and the GP prior means that the prior for \(\textbf{f}\) is the multivariate normal \[\begin{equation}
p\left(\textbf{f} \mid \theta\right) = \mathcal{N}\left(\textbf{f} \mid \textbf{0}, \textbf{K} \right),
\end{equation}\] where the \(N \times N\) matrix \(\textbf{K}\) has entries \(\{ \textbf{K} \}_{in} = k(\textbf{x}_i, \textbf{x}_n \mid \theta_{\text{GP}})\).
The below table shows which parts of the above mathematical description are affected by which arguments to lgp()
or create_model()
. You can read more about them in the documentation of said functions.
Argument | Affected model part |
---|---|
formula |
\(k(\textbf{x}, \textbf{x}')\) |
data |
\(\mathcal{D}\) |
likelihood |
\(p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}})\) |
prior |
\(p(\theta)\) |
c_hat |
\(p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}})\) |
num_trials |
\(\mathcal{D}\) |
options |
\(k(\textbf{x}, \textbf{x}')\) |
likelihood
argument and observation modelsThe terms observation model and likelihood are used to refer to the same formula, i.e. \(p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}})\), though the former means it as a function of \(\textbf{y}\) and the latter as a function of \(\theta\). There are currently five observation models available and they all involve an inverse link function transformation \[
h_n = g^{-1}\left( f(\textbf{x}_n)+ \hat{c}_n \right)
\] where \(g\) is determined by the likelihood
argument and \(\hat{c}_n\) by the c_hat
argument. The below table shows what the link function is in different cases, and what parameter the corresponding observation model has.
likelihood |
Link function \(g\) | Parameter \(\theta_{\text{obs}}\) |
---|---|---|
gaussian |
identity | \(\sigma\) |
poisson |
logarithm | - |
nb |
logarithm | \(\phi\) |
binomial |
logit | - |
bb |
logit | \(\gamma\) |
In the Gaussian observation model (likelihood="gaussian"
), \[
p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}) = \mathcal{N}(y_n \mid h_n, \sigma^2)
\] \(\theta_{\text{obs}}=\sigma\) is a noise magnitude parameter.
The Poisson observation model (likelihood="poisson"
) for count data is \[
y_n \sim \text{Poisson}\left(\lambda_n \right),
\] where the rate is \(\lambda_n = h_n\).
In the negative binomial (likelihood="nb"
) model, \(\lambda_n\) is gamma-distributed with parameters \[
\begin{cases}
\text{shape} &= \phi \\
\text{scale} &= \frac{\phi}{h_n}
\end{cases},
\] and \(\phi > 0\) controls overdispersion so that \(\phi \rightarrow \infty\) corresponds to the Poisson model.
When selecting the binomial or beta-binomial observation model for count data, the number of trials \(\eta_n\), for each \(n=1, \ldots, N\) has to be supplied using the num_trials
argument. The binomial model (likelihood="binomial"
) is \[
y_n \sim \text{Binomial}(h_n, \eta_n),
\] where the success probability \(\rho_n = h_n\).
In the beta-binomial model (likelihood="bb"
), \(\rho_i\) is random so that \[
\rho_n \sim \text{Beta}\left(h_n \cdot \frac{1 - \gamma}{\gamma}, \ (1-h_n) \cdot \frac{1 - \gamma}{\gamma}\right),
\] and the parameter \(\gamma \in [0, 1]\) controls overdispersion so that \(\gamma \rightarrow 0\) corresponds to the binomial model.
When using the Gaussian observation model with sample_f=TRUE
the continuous response \(y\) is normalized to unit variance and zero mean, and \(\hat{c}_n = 0\) for all \(n\) is set. In this case the c_hat
argument has no effect. With sample_f = TRUE
, sensible defaults are used. See the documentation of the c_hat
argument of lgp()
for exact details and the 5. Model inference section for information about the sample_f
argument.
formula
argument and kernel functionsThe GP models of lgpr
are additive, so that \[\begin{equation}
k(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) = \sum_{j=1}^J \alpha_j^2 k_j(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}).
\end{equation}\] This is equivalent to saying that we have \(f = f^{(1)} + \ldots + f^{(J)}\) modeled so that each component \(j = 1, \ldots, J\) has a GP prior \[\begin{equation}
f^{(j)} \sim \mathcal{GP}\left(0, \alpha_j^2 k_j(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) \right),
\end{equation}\] independently from other components.
The number of components \(J\) is equal to the number of terms in your formula
. Terms are separated by a plus sign. An example formula with three terms could be
y ~ gp(age) + gp(age)*zs(id) + categ(group)
where y
, age
, id
and group
have to be columns of data
. Each formula term defines what the corresponding kernel \(k_j\) will be like, and what covariates and parameters it depends on. Each term adds one \(\alpha\) parameter in the GP parameter vector \(\theta_{\text{GP}}\), and possible additional parameters depending on the term.
Each term is a product (separated by *
) of what we call expressions. At this point we are not using standard R formula terminology because terms in lgpr
are parsed in a custom way. Each expression corresponds to one kernel, and the kernel \(k_j\) is the product of all the kernels in term \(j\). Inside parentheses, each expression must contain the name of one data
variable, as in gp(age)
. This determines what variable the kernel depends on. Most of the allowed expressions, their corresponding kernels, and allowed variable types are listed below.
Expression | Corresponding kernel | Allowed variable type |
---|---|---|
gp() |
Exponentiated quadratic (EQ) | Continuous |
zs() |
Zero-sum (ZS) | Categorical |
categ() |
Categorical (CAT) | Categorical |
gp_ns() |
Nonstationary (NS) | Continuous |
gp_vm() |
Variance-mask (VM) | Continuous |
Continuous covariates should be represented in data
as numeric
and categorical covariates as factor
s. Equations for different kernels are listed here briefly. See Timonen et al. (2021) for more motivation and details about what kind of effects they can model alone and in combinations.
The EQ kernel is \[ k_{\text{EQ}}(x,x' \mid \theta_{\text{GP}}) = \exp \left( -\frac{(x-x')^2}{2 \ell^2}\right) \] and it has the lengthscale parameter \(\ell\). Each EQ kernel adds one lengthscale parameter to \(\theta_{\text{GP}}\).
The ZS kernel is \[\begin{equation} k_{\text{ZS}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ \frac{1}{1 - M} \ \ \ \text{ if } z \neq z' \end{cases} \end{equation}\] where \(M\) is the number of different categories for covariate \(z\).
The CAT kernel is \[\begin{equation} k_{\text{CAT}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ 0 \ \ \ \text{ if } z \neq z' \end{cases} \end{equation}\]
The NS kernel is \[\begin{equation} k_{\text{NS}}(x, x' \mid a, \ell) = k_{\text{EQ}}(\omega_a(x), \omega_a(x') \mid \ell), \end{equation}\] where \(\omega_a: \mathbb{R} \rightarrow ]-1,1[\) is an input warping function \[\begin{equation} \omega_a(x) = 2 \cdot \left(\frac{1}{1 + e^{-a x}} - \frac{1}{2} \right), \end{equation}\] Each NS kernel adds one lengthscale parameter \(\ell\) and one warping steepness parameter \(a\) to \(\theta_{\text{GP}}\).
The VM kernel is \[\begin{equation}
k_{\text{VM}}(x, x' \mid a, \ell) = f^a_{\text{VM}}(x) \cdot f^a_{\text{VM}}(x') \cdot k_{\text{NS}}(x, x' \mid a, \ell),
\end{equation}\] where \(f^a_{\text{VM}}(x) = \frac{1}{1 + e^{-a h_2 (x-r)}}\) and \(r = \frac{1}{a} \text{logit}(h_1)\). The parameters \(h_1\) and \(h_2\) are determined by opt$vm_params[1]
and opt$vm_params[2]
, respectively, where opt
is the options
argument given to lgp()
. Each VM kernel adds one lengthscale parameter \(\ell\) and one warping steepness parameter \(a\) to \(\theta_{\text{GP}}\).
All kernels that work with continuous covariates are actually also multiplied by a binary mask (BIN) kernel \(k_{\text{BIN}}(x,x')\) which returns \(0\) if either \(x\) or \(x'\) is missing and \(1\) if they are both available. Missing data should be encoded as NA
or NaN
in data
.
There are also the het()
and unc()
expressions. They cannot be alone in a term but have to be multiplied by EQ, NS or VM. They are not actually kernels alone but edit the covariate or kernel of their term and add additional parameters. See the tutorials for example use cases and Timonen et al. (2021) for their mathematical definition.
After the model is defined, lgpr
uses the MCMC methods of Stan to obtain draws from the joint posterior \(p\left(\theta, \textbf{f} \mid \mathcal{D}\right)\) or the marginal posterior of parameters, i.e. \(p\left(\theta \mid \mathcal{D}\right)\). Which one of these is done is determined by the sample_f
argument of lgp()
or create_model()
.
This option is always possible but not recommended with likelihood = "gaussian"
. The joint posterior that is sampled from is \[\begin{equation}
p\left(\theta, \textbf{f} \mid \mathcal{D}\right) \propto p\left(\theta, \textbf{f}\right) \cdot p(\textbf{y} \mid \textbf{f}, \theta) \\
\end{equation}\] and sampling requires evaluating the right-hand side and its gradient thousands of times.
This option is only possible (and is automatically selected by default) if likelihood = "gaussian"
. This is because \[\begin{equation}
p\left(\textbf{y} \mid \theta\right) = \mathcal{N}\left(\textbf{y} \mid \textbf{0}, \textbf{K} + \sigma^2 \textbf{I} \right)
\end{equation}\] is analytically available only in this case. The distribution that is sampled from is \[\begin{equation}
p\left(\theta \mid \mathcal{D}\right) \propto p\left(\theta\right) \cdot p(\textbf{y} \mid \theta) \\
\end{equation}\] and now sampling requires repeatedly evaluating the right-hand side of this equation and its gradient. This analytical marginalization reduces the MCMC dimension by \(N\) and likely improves sampling efficiency. The conditional posterior \(p\left(\textbf{f} \mid \theta, \mathcal{D}\right)\) also has an analytical form for a fixed \(\theta\), so draws from the marginal posterior \(p\left(\textbf{f} \mid \mathcal{D}\right)\) could be obtained by first drawing \(\theta\) and then \(\textbf{f}\), according to the process \[\begin{align}
\theta &\sim p\left(\theta \mid \mathcal{D}\right) \\
\textbf{f} & \sim p\left(\textbf{f} \mid \theta, \mathcal{D}\right).
\end{align}\] By combining these, we again have draws from the joint posterior \(p\left(\theta, \textbf{f} \mid \mathcal{D}\right)\), but likely with less computational effort.