# Latent Gaussian Process Regression

###### Abstract

We introduce Latent Gaussian Process Regression which is a latent variable extension allowing modelling of non-stationary multi-modal processes using GPs. The approach is built on extending the input space of a regression problem with a latent variable that is used to modulate the covariance function over the training data. We show how our approach can be used to model multi-modal and non-stationary processes. We exemplify the approach on a set of synthetic data and provide results on real data from motion capture and geostatistics.

references.bib

## Introduction

Gaussian processes (GPs) are probabilistic objects that can be employed as priors to specify distributions over spaces of functions. This provides models with principled uncertainty specification and allows for Bayesian regularization to balance model complexity with model fit. The flexibility of GPs stems from their non-parametric structure where the characteristics of the prior is fully encapsulated in the choice of covariance and mean function. In all but few cases, the mean function is set to be constant leaving only the covariance function to be chosen.

Most covariance functions are stationary which means that there is a single structure of variations independent of location in the input space. However, for many types of data, the assumption of a stationary process is not suitable making non-stationary covariances desirable. Creating such covariances often leads to an explosion in the number of parameters, effectively removing the benefit of a non-parametric model.

An additional challenge with GPs is that they are limited to modelling a single function. Often we have data where, in certain parts of the input space, the data has been generated by several different functions. In such scenarios we desire a model that switches automatically between functions allowing the data to be represented by several different processes.

In this paper we present a unified framework that tackles both these problems. It allows modelling structures, such as non-stationary and multi-modal functions, using GPs without an explosion in the number of parameters. Specifically, we extend any covariance function with an additional latent space that encapsulates this structure in a non-parametric manner leading to a single GP with a specific covariance function.

During inference, we marginalise out these latent variables from the model using the variational approach of [Titsias:2010tb]. This method depends on computing expectations over the covariance function of the GP. This is only analytically tractable for a subset of covariance functions, limiting the applicability of the approach. This motivates the second contribution of this paper. We show that these expectations can be approximated efficiently using Monte Carlo methods, yielding otherwise intractable covariances (such as ours) tractable. It is also beneficial compared to when the expectations need to be analytically tractable as it allows for rapid prototyping by removing challenging and time consuming derivations.

## Background

An attractive property of Gaussian processes is that, through very simple means, it is possible to formulate priors that are both interpretable and expressive. Examples are covariance functions, such as the squared exponential, which with a single parameter encode a global smoothness structure. However, for many types of data these global assumptions are not valid. There has been significant interest in how to describe non-stationary covariances allowing for either changing function behaviour, as in [DBLP:conf/icml/AdamsS08], or for hetroscedastic noise, as in [Lazaro-GredillaT11].

In multi-task learning, covariance functions have been defined that are able to model variations between output dimensions such as [Alvarez:2008wy]. As positive-semidefinite kernels are closed under several different operations, there has also been work on how to combine covariances to generate more expressive models. In [2011:Duvenaud], the authors present an approach where a class of additive covariances is described. In a continuation of this work, the choice of covariance function was formulated as a search problem [DuvLloGroetal13] where a set of base covariances could be combined using additions and multiplications. Using these techniques it is possible to create far more expressive priors while still retaining the benefits of the GP framework.

When moving from modelling stationary to non-stationary covariances, the prior assumption changes from that of a global structure to one of input dependent, local structures. When there are multiple global and/or local trends, these may be modelled by operations on the covariances; for example, as being generated from a sum of globally varying processes in [liutkus2011gaussian] or as a sum of (potentially infinitely many) local experts in [rasmussen2002infinite]. However, all of these models use a single generating mapping from input space to output space. When there are multiple processes generating the data independently, the observations are still forced to be explained in terms of their covariances with respect to all other observations based on their position in the input space. When the underlying data generating processes give rise to multiple modes in the output space, at a single location in the input space, a single-process GP model must resort to explaining the data as noise - failinlg to capture the density of the data and failing to generalize from it. The strength of the GP to model smooth functions is not being utilized in this case as the covariances within each independent partition of the observations, produced by their respective underlying data generating process, cancel each other out.

In [lazaro2012overlapping] the authors present an approach to model overlapping GPs, exemplified as a method for e.g. multi-target tracking scenarios and modelling of heteroscedasticity. In their approach, they model data-association of the observations to independent GPs via a latent association matrix. However, since the structure of the latent subspaces created by the association matrix is not explicitly modelled the approach only allows modelling of structures where there is no interdependency between groups. The method we propose in this paper includes this approach as a special case.

The creation of more complicated covariance structures presents a particular challenge during inference. GPs, in their simplest form, scale cubicly with the data which has lead to a significant amount of work on reducing this computational complexity. In [DBLP:conf/icml/DeisenrothN15] a factorised approach is presented which represents the GP as a product-of-experts that allows for massively distributed computations. However, creating a factorised model such as this relies on making independence assumptions that are not necessarily straight forward or are restrictive on modelling power.

Another approach is to sparsify the GP and use a smaller set of points, referred to as inducing points, to approximate the full model [DBLP:journals/jmlr/CandelaR05]. This then presents a challenge on how to select the inducing points. In [Titsias:2009vf] a variational approach was presented that learns these inducing points by viewing them as variational parameters specifying the bound. Using the same approach, the authors showed that an extension of the same idea applies to unsupervised learning with GPs [Titsias:2010tb] facilitating approximative integration of the latent variables. The downside of the variational approach presented in [Titsias:2009vf, Titsias:2010tb] is that it requires calculations of expectations over the covariance function. These calculations cumbersome and sometimes intractable which limits scope of applicable covariance functions.

In this paper, we describe a simple extension to any covariance function that allows modelling of non-stationary multi-modal behaviour. Our formulation can model both non-stationary functions, i.e. when the behaviour of the function is different in different parts of the input domain, and also multi-modal functions where a single input location can be associated with several different outputs. Our approach is based on combining any covariance function with an additional covariance over a latent input space. Using this approach we create a non-parametric model for non-stationary and multi-modal data. We approximately integrate out the latent space using the variational approach in [Titsias:2010tb]. We show empirically that it is possible to approximate the challenging expectations using efficient sampling. This makes our approach applicable to any type of covariance independent of whether or not the expectations can be computed in closed form.

## Latent Gaussian Process Regression

Consider a set of input output pairs generated from a non-stationary process . We can describe this function as an expansion of processes,

(1) |

where is modelled by a GP. The functions modulate the base functions and encode non-stationary multi-modal behaviour by smoothly segmenting the different functions over the training data. Modelling a sum of functions as a GP is straight forward so the main challenge is how to parametrise . The approach that we take in this paper is to extend the input domain with a latent variable , such that we have

(2) |

Now, rather than directly modulating the output of the function, the latent variable can be used to modulate the covariance function in a non-parametric manner. This allows for modelling of non-functions by differentiating between several outputs at the same input location by altering . This leads to the following latent regression model,

(3) |

where is a GP prior over additive functions. By marginalising out the latent variables and the GP prior we can recover the standard marginalised likelihood for GP regression,

(4) |

An intuition behind this approach is to think of the marginalisation as a projection, where multiple single-modal processes over the extended input space becomes multi-modal when projected onto the subspace of the original data. We will now describe how we achieve this by using a simple class of covariance functions that we will refer to as juxtaposition kernels.

### Juxtaposition Covariance

In order to achieve the desired characteristics described above, we will study covariances that are defined as sums of products of different covariance functions evaluated in both the original and the extended input space as,

(5) |

where is a kernel function over latent inputs , a subset of , and is a kernel function over the observed inputs . This can be viewed as a GP-prior consisting of a weighted sum of different kernels where describes the influence of for the input pair and . In this paper we will evaluate a special case of this kernel that encourages a factorised representation such that the covariance of each pair of observed data points is modelled by either a single kernel function or considered independent. This is achieved by using a linear kernel over the latent space,

(6) |

and constraining the latent locations to reside in the corners of a simplex in the unified latent space such that . In this special case, the compound kernel (5) can be interpreted as a continuous convex association of each observation to a respective component kernel. To achieve this, a transformation is deployed as,

(7) |

where encodes the strength of the discritisation. Since is linear and is constrained to a simplex it is sufficient for it to be one-dimensional, denoted . To remove effects of initialisation, we use a simple annealing scheme to set starting with a small value that increases each iteration of the optimisation. The motivation behind this is that with a small value, the associations can easily be altered while it is associated with a significantly higher “cost” for large alphas. This yields a factorising variant of the juxtaposition kernel, summarized as,

(8) |

Using the factorisation above we aim to address the same problem formulation as in [lazaro2012overlapping] however our method is capable of generalising to any structures over the latent space which can be encoded using different latent space priors and is not limited to the linear kernel as explained above. Further, by removing the transformation Eq. 7 we can allow for continous mixtures of processes rather than a discretely factorised.

Predictive inference for novel input locations can be done through the posterior of the model. However, during prediction the latent locations are not known and need to be marginalised from the model,

(9) |

When the latent locations are confined to lie on the corners of a simplex the integral in Eq. 9 reduces to a sum over those corners. This means that the posterior becomes a Gaussian mixture where the distribution over the latent locations can be interpreted as mixture coefficients. This distribution is not analytically tractable hence we proceed with an assumption. The predictive uncertainty of each component is comparable. Therefore an expression of the relative certainty can be recovered given a coordinate in the input space,

(10) |

where is the square root of the predictive variance of the model for component . This means that outputs can then be generated from the model at any given input location by first sampling a component according to the probabilities in (10) and then sampling from the drawn component. Examples of samples and are provided in Figures 1, 4 and 5.

### Variational Gaussian Process Latent Variable Model

The GP-LVM framework used throughout this paper, introduced in [damianou2015variational], deploys auxiliary inducing variables as a mean of marginalizing out the input. For our model the input is and , which we jointly denote . The objective function is specified as a lower bound on the data evidence. In evaluating the lower bound, the following expectations, referred to as sufficient statistics, need to be evaluated:

(11) |

where and and are variational parameters. These expectations are only analytically tractable for some convariance kernels, e.g. the linear and the squared exponential. Thus, the variational framework remains limited to the class of kernels where these expectations are tractable, significantly suppressing its modelling power. Critically, the class of kernels where these expectations are analytically tractable does not include our juxtaposition kernel.

### Stochastic Approximations of Expectations

For many kernels, including ours, evaluating these expectations is not tractable; to make progress, we will proceed with a Monte Carlo approach. Besides enabling the variational framework to be used with a vastly larger set of kernels, it is easy to implement, computationally fast and, as we will show, capable of yielding virtually equivalent lower bounds. Below we provide an intuition for why this is.

Within the Variational GP-LVM framework [Titsias:2010tb, damianou2015variational], the form of the approximate posterior over the latent space is selected to be a known parametric distribution (Gaussian) and each observation’s input location is parameterized with an egocentric, independent Gaussian for every input dimension. The expectation approximation accuracy over each entry in is thus independent of dimensionality. In addition the expectations of and are aggregates over , which further reduces the approximation error. Further, when used as part of an iterative optimization, each expectation approximation error over the course of the optimization procedure is independent. Thus the approximation error of the expectations results in a ‘noisy’ gradient which is correct on average (cf. stochastic gradient descent). In summary, the empirical means can be expressed as,

(12) |

where and are obtained using .

#### Implementation

We implement the stochastic approximation using the ‘reparameterization trick’ as discussed in [Kingma:2014] and [DBLP:conf/nips/KingmaMRW14] to ensure we obtain low variance estimates for the expectations. The entire architecture is implemented using the Tensorflow framework [Tensorflow]. This allows us to propagate gradients through the sampling procedure as if they were analytically calculated.

^{inline,color=purple}

^{inline,color=purple}todo: inline,color=purpleNC: Added the implementation paragraph above - hope it makes sense?

We will now proceed with the experiments where we provide empirical evidence that the number of samples can be small enough to be practical to compute while still obtaining accurate enough approximations of the expectations.

## Experiments

Here, we demonstrate the suitability of Monte Carlo methods for approximating the expectations of (11) as in (12). We show this with respect to the effect on the resulting lower bound. In all experiments in this section the squared exponential kernel is used to provide a comparison to the analytical expectation for a common case. Throughout the rest of the paper, the juxtaposition kernel (8) used is analytically intractable and we rely entirely on the sampling method.

#### Lower Bound

In Fig. 3, the log lower bounds obtained by using the exact analytical expectations are compared to using Monte Carlo sampling. As can be seen, the lower bound obtained using approximations with 10 or more samples per iteration follows the one obtained by the exact analytical expectation closely. Furthermore, even using just one sample to approximate the expectation the lower bound converges to a value close to the analytical. This is in agreement with the findings in [Kingma:2014] and [DBLP:conf/nips/KingmaMRW14], and makes intuitive sense since the approximation error at every iteration is independent of the error at other iterations and has zero mean; resulting in a ‘noisy’ gradient of the cost function which is correct on average.

#### Complexity

In terms of computation time, there is an important difference between the sampling method we deploy and the use of analytic expectations. The complexity of the analytic expectation, and its derivatives, can increase greatly, even to the point of making it intractable, to that of directly evaluating the covariance matrix that is needed, regardless, for both methods. This issue is magnified particularly in the case of compound kernels.

In the best case, when using the analytical expectation the time complexity remains unchanged. In the worst case, using e.g. a simple linear kernel, the sampling method would only increase the complexity by a constant factor (evaluating the covariance matrix times) assuming that the expectations are the computational bottleneck. However, the covariance matrix evaluations can be run in parallel. If the constant can be kept low without negatively impacting the obtained lower bound, this can lead to substantial speed-ups (depending on kernel and data size). In our specific environment the total computation time of the lower bound using Monte Carlo sampling with was less than half compared to using the analytical expectation for the squared exponential kernel (and twice with ). Throughout the rest of the paper is used for the analytically intractable presented kernel.

### Juxtaposition kernel

#### Multi-Modality

The ability of the model to disambiguate distinct modes in the output space, at the same locations in the input space, is illustrated in Fig. 1. Since the two partitions of the observations are conditionally independent they can be explained by different covariance functions and a significantly better data fit can be obtained. Given the regularizing properties of the used variational framework, balance between model complexity and data fit is recovered automatically. The data in the example cannot be represented by a Gaussian likelihood satisfactory; this forces a standard (‘single-process’) GP to explain the data using a high noise variance. The result is poorly explained observations and a model with low predictive power. By allowing observations to become conditionally independent of observations close in the input space, via the latent space, a more probable and useful explanation of the data is obtained. A real world example of this is illustrated in Fig. 2, where joint positions are disambiguated in motion data.

Posterior predictions of mean and variance are provided by the individual components allowing for prediction in the input space. We compare the results with those of a normal covariance that is forced to explain all of the variations in the signal as noise. Another view-point is to think of our model as a means to cast a multi-modal problem as a regression problem. An extreme case of this is shown in Fig. 5 where the letter S is decomposed into three different functions. Importantly, we can predict three different outputs from a single input space, in effect we have decomposed this multimodal regression problem into a regression problem with one free latent variable that differentiates between the modes.

#### Non-Stationarity

An example of modelling single-modal non-stationarity in the form of heteroscedastic noise is found in Fig. 4. Here the factorisation kernel is used as a means of forming a GP prior where individual observations are explained by either a noiseless or noisy smooth function. In this model, the covariance of observations are dependent on the local properties of the data partitioning in the input space (such as relative density) in addition to the hyper parameters governing their respective components covariance function. As a result, the compound covariance function can model local noise characteristics of the data non-parametrically. By comparison, the standard squared exponential covariance overestimates the variance in the noiseless region while underestimating (being overconfident) in the noisy regions.

### Jura Geostatistics

In Fig. 5(a) the effects of the properties of the presented kernel is illustrated on a real world data set comprised of measured element concentrations throughout the geographical region of Jura, Switzerland [goovaerts1997geostatistics]. The geographical distance between neighbouring data points in the data set are around a kilometer. We believe it is fair to assume that the concentrations within the area around each data point are ‘sporadically mixed’ rather than ‘homogeneously blended’. In other words, the soil within an area is not necessarily blended such that individual samples of it contain a representative concentration for the region. An analogy would be trying to locate a suitable location for a gold mine; a single measurement of gold concentration within a square kilometer, even in the most gold rich areas, can result in both a low concentration and ‘hitting the motherlode’. We model this as that any individual measurement (or sample) within a given area is drawn from any of probability distributions. The probability for a given distribution depends on the location, which we model as . For the given data, we make the conservative assumption that and that independently each of the two generating processes are stationary and smoothly varying; this we model by parameterizing the factorizing kernel using squared exponential kernels. As can be seen in the respective figures, the non-stationary process is not forced to explain the local high concentrations as high levels of global noise but can capture the multi-modal smoothly varying trends of the data set.

## Conclusion

We have presented Latent Gaussian Process Regression, a natural extension to GP regression that allows modelling of non-stationary multi-modal processes using a simple combination of any covariance functions. We show how our method can be used to factorise a signal into several different processes which allows modelling of multi-modal data. We also show how non-stationary single-modal data can be modelled using the same approach. Our approach builds on a latent variable extension of the input domain which is approximatively marginalised out. We provide empirical evidence which highlights that a simple sampling based approach can be used to replace expensive, and sometimes intractable, expectations in the traditional variational formulation of GP-LVMs. In its current form the number of components is a free parameter which we, in later work, hope to directly infer from data.