Scalable Gaussian Processes : An Introduction

6 min readOct 21, 2019

There are plenty of posts on the world wide web about Gaussian Processes(GP’s), and this will be one of them, however we focus on some methods which aims to alleviate the practical issues surrounding GP’s.

The post is constructed as follows:

Overview of Gaussian Processes and how we get there.
Scalable Gaussian Processes. We split this section into the following:

Global Approximation Methods
Local Approximation Methods

3. Implementing and GPU adapation

Overview

We will briefly highlight the main ideas behind GP’s so that we can build on them.

If we consider regular linear regression, the aim is to find the optimum weights that will minimise some loss function, usually the mean squared error.

y represents the output, and X as the input, with Beta representing the coefficients and epsilon the error.

As the linear model uses coefficients which are scalar,the next step in complexity is to claim that the model is a function of a Distribution(usually normal distribution).

In the above graph, it can be clearly seen that while the linear regression (red) is a scalar, the line produced by the Bayesian regression (dark blue, with lighter blue for standard deviations) follows a distribution (normal).

We can further expand on Bayesian regression, and say that every combination of those data points forms a multivariate Gaussian distribution. Thus, the distribution of the GP is the joint distribution of all the random variables, and from this, we see the common definition of ‘ distribution over functions’.

Gaussian Process

The function is a Gaussian process, defined by m(x)- mean and by k(x,x’) — kernel. The kernel is usually the ‘squared exponential’, however, many exist can be utilised depending on the requirements.

From left to right; Squared Exponential, Browning Motion, Quadratic

One of the main processes that GP’s use to infer requires the use or marginalising a distribution.

From the above equation, we can see that f* is the new function (i.e. new data) and we can use the previous f (the GP that we trained the data on).

We can see above that to get to a position where we infer our current data on the previous trained set requires us to marginalise us out f from the system.

The graph below shows a simple representation of GP’s. At the points, we can see that the variance is very low, while in between points (grey area), we can clearly see larger variance.

We can see here the predicted mean in red, and the 2-sigma range in grey. At the point around the data, the variance is extremely low.

GP’s are valuable modelling tool as they allow us to develop a model which can output values with variances. This allows us to manage error far better than point estimates.

Gaussian Processes also are not limited to regression problems but can be utilised for classification also.

Gaussian Process utilised for classification

A really good introductory discussion to GP’s: https://katbailey.github.io/post/gaussian-processes-for-dummies/. For the more keen reader, Gaussian Processes for Machine Learning by Rassmussen is available to download for free: http://www.gaussianprocess.org/gpml/.

Utilising Gaussian Processes

Like all models, we have to first train the system, and then use it to infer. Without getting into too much detail, there is an objective function , log p(y) which needs to be optimised. In order to solve this system, we need to conduct matrix inversion of the kernel. This process of inversion, utilising cholesky decomposition, is O(n³), and thus we can see that utilising GP’s for big data quickly becomes intractable.

How can we get around this problem?

Comparing complexity between local and global approximation methods

There are really two main approaches, looking at conducting approximation at a global level, and approximating at a local level. The chart above shows the differences in complexities that we can look to achieve when using a particular approximation method.

The following work will look at some techniques which can be used to overcome this problem. We will be closely following the work of [1] and [2].

Scalable Gaussian Processes

In the following we outline algorithms which aim to approximate GP’s in one way or another and also the hardware improvements which have allowed GP’s to become utilised in big data scenarios.

A breakdown of methods for Scalable GP’s

For more detail check out : Gaussian Processes by Rasmussen

Global Approximation

Within global approximation, there are three methods:

Subset of Data

As the name suggests, you only train the model on a subset of data. You can go from O(n³) down to O(m³), where m << n, where m represents the size of the subset of the data. Naturally, taking a subset of the data has the potential to represent global features, but not local ones. In terms of sampling the data, we can use some simple heuristics:

Randomly sample the data set.
Conduct k-means on the data set and select centroids [3]
Active learning methods: This can include differential entropy, information gain and matching pursuit.(This will lead to higher computational costs).

Sparse kernels

Kernels are a key component in GP’s. The aim here is to represent the kernel utilising a compactly supported(CS) kernel, which imposes that K(xi,xj) = 0 when |xi -xj| exceeds a threshold. From this approach, we can reduce the complexity down to O(alpha*n³), where 0 < alpha < 1. [4]

Sparse Approximations

The aim is here to approximate the full-rank kernel matrix with the traditional path being to conduct eigen-value decomposition. .Unfortunately, this is itself O(n³) and thus not beneficial in this scenario.

Instead, we can aim approximate the eigenvalues with Nystrom approximation utilising m data points. Utilising the Nystrom approximation, we can then go on to build sparse approximations.

There are three main categories which sparse approximations.

Prior approximation which approximates the prior but performs exact inference. Here methods such as Subset of Regressors/DIC,DTC,FITC
Posterior approximations which retain the exact prior but perform approximate inference. Here methods such as Variational Free Energy methods/FITC are utilised.
Structures sparse approximation which exploit specific structures in kernel matrix.

Local approximation

Naive Local Experts
Mixture-of-Experts
Product-of-experts

While we have highlighted methods which can be utilised, an area of active research in which there has shown to be scalability with reliable and robust results is posterior approximation methods.

Implementation and GPU adaption

As mentioned above, Global approximation methods, particularly sparse approximation methods allows us to reduce the complexity from O(n³) to O(nm²). However, in real-time scenarios, GP’s are not the algorithm of choice.

GPflow written in tensorflow and GPyTorch written in Pytorch have been developed to utilise GPU hardware.

Next Up

This is the first in a series if posts aiming to highlight the different approaches to making Gaussian Processes usable in the era of Big Data.

Part 2 : Global Approximation — Prior Approximation methods. In this article, we look at Global approximation and discuss prior approximation methods in detail.

Part 3: Global Approximation — Posterior Approximation methods. In this article , we look at Global approximation Methods and discuss Posterior Approximation methods in detail. These have been a hot topic of research in recent times and have shown to have great results.

Part 4: Testing various techniques. In this post we look at some of the main packages for scalable GP’s in Python and test them on various data sets and assess their results.

Bibliography

[1] H. Liu, Y.Ong, X.Shen,J.Cai, When Gaussian Process Meets Big Data : A Review of Scalable GP’s

[2] J.Hensman, N.Fusi, N.D.Lawrence Gaussian Processes for Big Data

[3] F.P.Preperata, M.I.Shamos, Computational geometry: An introduction

[4] L Csato, M Opper , Sparse Representation for Gaussian Process Models