Scalable Gaussian Processes: Prior Approximation methods
This article builds on the previous article which highlighted techniques which exist in scaling Gaussian Processes. In this post we discuss Prior Approximation Methods.
The issue with scaling GP’s is the complexity of O(n³) that they have. This is due to the inversion of the kernel. We can overcome this problem by looking to approximate the eigenfunctions using m data points (these m points are known as ‘inducing points’ and will be discussed further on), a process known as Nystrom Approximation.
In the above representation of the Nystrom Approximation, K is the Kernel.
How can this approximation be utilised to scale GP’s?
The Nystrom Approximation allows us to build a generative probabilistic model using the m inducing points to summarise the data set. We consider:
f(Xm,fm), a set of inducing pairs. We use the fact that fm is a sufficient statistic for f. That means that for any variable z, p(z|f,fm) = p(z|fm), i.e. it accounts for all the information.
The prior approximations modify the joint prior, and then we can apply the Nystrom Approximation on the new prior.
In order to approximate the prior and reduce the complexity, we need to make the assumption that fm and f* are conditionally independent. This leads us to then be able to marginalise out fm , leaving the following. [1]
Our new train and test conditionals become:
Now, we can approximate the standard log p(y) objective function with log q(y).
If we carefully select Qtilda_nn , then we can reduce the complexity down to O(nm²).
How do we select Q?
**I am utilising Q in the title as Qtilda_nn from above.
Subset-of-regressors method, also known as Deterministic Inducing Conditional(DIC) states that we can take Qtilda_nn = 0 and Qtilda_** = 0.
The Deterministic Training Conditional (DTC) imposes Qtilda_nn = 0 on the training conditional, but keeps the exact test conditional. [2]
The Fully Independent Training Conditional (FITC) imposes a different training conditional, as seen below and keeps the test conditional the same.
Inducing Points
While we have discussed methods that take m points , as described above, the selection of these points can also be considered hyperparameters. Thus, we are in a situation were our complexity has reduced to O(nm²), yet, we still have suboptimal parameters which need optimising, thus increasing the complexity.
In Posterior Approximation , we look at a group of methods which simultaneously optimise the objective function and the inducing points.
Bibliography
[1] J.Quinonero-Candela, C.E.Rassmussen, A unifying view of sparse approximation gaussian processes
[2] A.J.Smola, P.L.Bartlett, Sparse Greedy Gaussian process regression