Scalable Gaussian Processes: Prior Approximation methods

ObjectPetitU
3 min readOct 21, 2019

This article builds on the previous article which highlighted techniques which exist in scaling Gaussian Processes. In this post we discuss Prior Approximation Methods.

Within Sparse Approximation methods, we focus on Prior Approximation Methods.

The issue with scaling GP’s is the complexity of O(n³) that they have. This is due to the inversion of the kernel. We can overcome this problem by looking to approximate the eigenfunctions using m data points (these m points are known as ‘inducing points’ and will be discussed further on), a process known as Nystrom Approximation.

Nystrom Approximation

In the above representation of the Nystrom Approximation, K is the Kernel.

How can this approximation be utilised to scale GP’s?

The Nystrom Approximation allows us to build a generative probabilistic model using the m inducing points to summarise the data set. We consider:

f(Xm,fm), a set of inducing pairs. We use the fact that fm is a sufficient statistic for f. That means that for any variable z, p(z|f,fm) = p(z|fm), i.e. it accounts for all the information.

The prior approximations modify the joint prior, and then we can apply the Nystrom Approximation on the new prior.

In order to approximate the prior and reduce the complexity, we need to make the assumption that fm and f* are conditionally independent. This leads us to then be able to marginalise out fm , leaving the following. [1]

Our new train and test conditionals become:

These are the new train and test representations

Now, we can approximate the standard log p(y) objective function with log q(y).

function that will be optimised

If we carefully select Qtilda_nn , then we can reduce the complexity down to O(nm²).

How do we select Q?

**I am utilising Q in the title as Qtilda_nn from above.

Subset-of-regressors method, also known as Deterministic Inducing Conditional(DIC) states that we can take Qtilda_nn = 0 and Qtilda_** = 0.

The Deterministic Training Conditional (DTC) imposes Qtilda_nn = 0 on the training conditional, but keeps the exact test conditional. [2]

The Fully Independent Training Conditional (FITC) imposes a different training conditional, as seen below and keeps the test conditional the same.

Inducing Points

While we have discussed methods that take m points , as described above, the selection of these points can also be considered hyperparameters. Thus, we are in a situation were our complexity has reduced to O(nm²), yet, we still have suboptimal parameters which need optimising, thus increasing the complexity.

We can see the original data set as black dots, and the inducing points as black crosses at the bottom of the screen. We can see that the addition of inducing points reduces the variance (in grey).

In Posterior Approximation , we look at a group of methods which simultaneously optimise the objective function and the inducing points.

Bibliography

[1] J.Quinonero-Candela, C.E.Rassmussen, A unifying view of sparse approximation gaussian processes

[2] A.J.Smola, P.L.Bartlett, Sparse Greedy Gaussian process regression

--

--